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1.  Introduction 


1.1  Background 

A  future  vision  of  the  use  of  autonomous  and  intelligent  robots  in  dismounted 
military  operations  is  for  Soldiers  to  interact  with  robots  as  teammates,  much  like 
Soldiers  interact  with  other  Soldiers  (Brown  2011;  Lilley  2013;  Phillips  et  al.  2013; 
Redden  et  al.  2013).  Soldiers  will  no  longer  be  operators  in  full  control  of  every 
movement,  as  the  autonomous  intelligent  systems  will  have  the  capability  to  act 
without  continual  human  input.  However,  Soldiers  will  need  to  use  the  information 
available  from  or  provided  by  the  robot.  One  of  the  critical  needs  to  achieve  this 
vision  is  the  ability  of  Soldiers  and  robots  to  communicate  with  each  other.  This 
report  examines  one  mode  of  communication — gesture. 

Part  of  the  future  vision  includes  bidirectional  communication,  with  Soldiers  and 
robots  communicating  with  each  other.  However,  this  review  focuses  on  human 
gestures  to  instruct  and  command  robots.  Therefore,  we  are  describing  (for  the  most 
part)  one-way  communications  from  human  to  robot.  While  this  is  very  important, 
it  is  only  one  part  of  the  larger  vision  for  humans  and  intelligent,  autonomous 
systems  to  interact  with  each  other.  Many  efforts  are  focused  on  the  use  of  gestures 
for  robot  control;  in  this  report,  we  discuss  technology  options  and  issues  impacting 
effectiveness  for  robot  control  in  military  operations. 

The  use  of  gestures  as  a  natural  means  of  interacting  with  devices  is  a  very  broad 
concept  that  includes  a  range  of  body  movements,  including  movements  of  the 
hands,  arms,  and  legs,  facial  expressions,  eye  movements,  head  movements,  and/or 
2-dimensional  (2-D)  swiping  gestures  against  flat  surfaces  such  as  touch  screens 
(Saffer  2008).  Gesture-based  technology  is  already  in  place  and  commonly  used 
(e.g.,  public  buildings,  public  restrooms)  without  special  instruction  required  for 
effective  use.  A  common  example  of  a  well-designed  gestural  command  is  the  use 
of  hands  to  “wave”  to  activate  (e.g.,  public  bathroom  faucet).  This  concept  is  also 
common  to  gaming  interfaces  (e.g.,  Kinect)  and  is  now  extending  to  other  private 
and  public  domains  such  as  automobile  consoles  (Stokloso  2015). 

The  most  cursory  examination  of  the  gestural  literature  shows  great  breadth  and 
depth,  with  investigations  of  gestures  arising  from  a  variety  of  fields.  Classification 
and  interpretation  of  gestures  have  been  discussed  and  reviewed  over  many  years 
(Pavlovic  et  al.  1997;  Moore  et  al.  1999;  Kendon  2004).  Gestures  using  hand 
motion  are  most  common,  and  one  useful  classification  scheme  is  by  purpose:  a) 
conversational,  b)  communicative,  c)  manipulative,  and  d)  controlling  (Wu  and 
Huang  1999).  Conversational  gestures  are  those  used  to  enhance  verbal 
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communications,  while  communicative  gestures,  such  as  sign  language,  comprise 
the  language  itself.  Manipulative  gestures  can  be  used  for  remote  manipulation  of 
devices  or  in  virtual  reality  settings  to  interact  with  virtual  objects.  Controlling 
gestures  can  be  used  in  both  virtual  and  reality  settings,  and  are  distinguished  in 
that  they  direct  objects  through  gestures  such  as  static  arm/hand  postures  and/or 
dynamic  movements. 

Of  particular  interest  in  this  report  are  applications  and  advancements  with  regard 
to  controlling  gestures  for  human-robot  interactions.  There  have  been  many  studies 
showing  usefulness  of  more  naturalistic  interfaces  for  robot  control  (Goodrich  and 
Schultz  2007).  The  study  of  gestures  for  robot  control  in  itself  is  a  huge  field  of 
endeavor.  Gestures  can  be  as  simple  as  a  static  hand  posture  or  may  involve 
coordinating  movements  of  the  entire  body  (Yang  et  al.  2006).  Gesture -based 
commands  to  robots  have  been  used  in  a  variety  of  different  settings,  such  as 
assisting  users  with  special  needs  (Jung  et  al.  2010),  assisting  in  grocery  stores 
(Corradini  and  Gross  2000),  and  home  assistance  (Muto  et  al.  2009).  Examples  of 
gestural  commands  in  these  settings  include  “follow  me”,  “go  there”,  or  “hand  me 
that”.  There  are  also  advancements  in  various  industrial  settings  to  control  robotic 
assembly  and  maneuver  tasks  (Lambrecht  et  al.  2011;  Barattini  et  al.  2012). 

Some  aspects  of  gesture  control  are  not  included  in  this  report.  Stroke  gestures 
made  upon  a  screen  (e.g.,  tablet,  smartphone)  represent  a  different  domain  of 
gesture  control,  which  is  also  of  interest  to  military  human-robot  interactions 
(O’Brien  et  al.  2009).  Research  regarding  these  stroke  gestures  attends  to  issues 
that  define,  develop,  and  validate  approaches  and  taxonomies  relating  to  the  stroke 
gesture.  This  report  does  not  address  these  issues,  but  rather  is  focused  on  free-fonn 
gestures  made  by  the  hand  and  arm,  technology  approaches  to  recognition  of  these, 
and  how  they  may  impact  effectiveness  within  a  military  human-robot  application. 
There  is  also  interesting  work  to  develop  gestures  for  the  robot  to  use  to 
communicate  back  to  the  operator,  such  as  head  nodding  (Muto  et  al.  2009;  St  Clair 
et  al.  2011b),  conversational  gestures  (Bremner  et  al.  2009),  and  queries  (Iba  et  al. 
2003).  While  bidirectional  human-robot  interaction  is  also  pertinent  to  military 
settings,  this  realm  of  research  deserves  a  separate  review  and  is  not  addressed  here. 

1.2  Purpose 

For  this  report,  we  start  with  a  very  general  nontechnical  overview  of  gestural 
controls  of  robots  in  general,  then  we  narrow  our  focus  to  efforts  more  specific  to 
control  of  military  ground  robots  by  dismounted  Soldiers.  We  start  with  findings 
that  are  more  general  because  certain  features  of  gestural  commands  to  robots  are 
relevant  to  all  settings:  they  should  be  intuitive,  familiar,  and  easily  distinguished. 
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Findings  with  regard  to  these  different  user  groups  can  still  generalize  to  military 
use,  given  the  cross-cultural  intuitive  nature  of  some  gestures  (e.g.,  pointing).  That 
is,  many  gestures  can  be  intuitively  recognized  and  used  across  population  groups. 
Technical  challenges  for  gesture-based  control  are  also  similar,  regardless  of 
operational  setting.  We  review  current  progress  and  issues  with  a  view  of  assessing 
different  technological  approaches  for  their  relevance  to  dismounted  Soldier 
systems.  The  technology  approaches  are  very  different  and  can  strongly  moderate 
effectiveness  for  different  situations. 

We  cover  2  very  different  approaches  to  gestural  control:  camera-based  systems 
and  wearable  instrumented  systems.  For  the  field  of  gestural  controls,  the 
technological  progress  is  rapid  and  distributed  among  many  different  approaches 
within  each  general  domain.  In  this  review,  we  have  attempted  to  include  papers 
that  represent  a  cross  section  of  relevant  approaches,  by  universities,  government, 
industry,  and  countries,  of  varying  disciplines  and  points  of  view.  Our  aim  is  to 
identify  the  major  approaches  and  corresponding  issues  across  the  diverse  range  of 
gesture  control  endeavors.  From  this,  we  discuss  characteristics  of  different 
gesture-based  systems,  with  regard  to  capabilities,  advantages,  and  limitations — as 
they  pertain  to  use  by  dismounted  Soldiers. 

1.3  Gestural  Control  Task  Demands 

A  front-end  issue  when  considering  use  of  any  new  system  is  consideration  of  the 
task  demands  and  situational  constraints.  Here,  we  identify  5  types  of  tasks  that  can 
impact  the  choice  of  technological  approach:  simple  commands,  complex 
commands,  pointing  commands,  remote  manipulation,  and  robot-user  dialogue. 
The  developer  should  always  start  with  a  deep  understanding  of  operational  and 
situational  task  demands.  In  the  following  subsections  we  describe  some  basic  tasks 
that  distinguish  the  kind  of  technology  and  approach  that  will  best  meet  the  user 
requirements. 

1.3.1  Simple  Commands  (Small  Set  of  Specific  Commands/Alerts) 

While  it  is  easy  to  think  of  single  commands  (e.g.,  “stop”,  “move  forward”,  “turn 
left”)  as  simple  commands,  one  should  keep  in  mind  it  is  not  the  command  per  se, 
but  the  distinguishability  and  the  intuitive  nature  of  the  gesture  that  detennines  ease 
of  use  and  recognition.  When  the  gesture  set  is  small,  recognition  rates  have  been 
high,  across  many  different  camera  and  glove-based  approaches.  Gestures  must  be 
distinguishable  from  one  another  and,  in  general,  gestures  using  movements  of  the 
arms  as  well  as  lingers  have  been  more  effectively  recognized. 
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1.3.2  Complex  Commands 

Complex  commands  are  characterized  by  higher  demands  for  deliberative  cognitive 
processing,  often  through  use  of  a  larger  gesture  set  and/or  combinations  of  gestural 
units  to  communicate  multiple  concepts.  Any  particular  gesture  set  lies  on  a 
continuum  of  complexity  (e.g.,  recognition  difficulty),  ranging  from  a  small  set  of 
elemental  gestures  (e.g.,  “move  forward”,  “turn  left”,  “turn  right”,  “stop”)  to 
gestures  sets  that  may  include  large  numbers  of  gestures,  associated  not  only  with 
meaning  (i.e.,  vocabulary),  but  also  elements  of  syntax  and  grammar,  depending  on 
sequence  (e.g.,  American  Sign  Language  [ASL]). 

As  the  gesture  set  becomes  larger,  the  challenge  to  effective  recognition  also 
increases.  It  is  critical  to  have  gestures  that  are  distinguishable  from  the  point  of 
view  of  the  recognition  process;  however,  it  is  also  desirable  to  have  gestures  that 
are  intuitive  to  the  command  and  easily  learned  (e.g.,  pointing).  There  is  no  distinct 
categorization  of  what  comprises  a  small  versus  large  gesture  set,  for  the  difficulty 
of  recognition  is  also  a  function  of  distinguishability.  It  should  simply  be  noted  that 
these  distinctions  exist  on  a  continuum,  and  this  continuum  of  gesture  set 
complexity  will  greatly  moderate  the  results  with  regard  to  recognition 
effectiveness.  This  issue  is  very  much  related  to  concepts  of  visual  salience 
(St  Clair  et  al.  2011b)  and  tactile  salience  (Mortimer  et  al.  2011). 

Scenario  complexity  is  also  increased  through  an  increase  in  scenario  actors  (human 
and  robot).  Given  a  multi-entity  situation,  the  goal  is  to  attain  robot  awareness  of 
the  environment  and  actors,  as  interpreted  through  shared  goals  and  perspective 
taking.  This  level  of  situation  assessment  and  understanding  by  the  robot  is  sought 
through  development  and  application  of  agent-based  and  other  computational  decision 
models  or  cognitive  architectures  (St  Clair  et  al.  201  la). 

1.3.3  Pointing  Commands 

A  fundamental  task  for  gesture  commands  is  that  of  directing  movement  for  ground 
robots.  Pointing  gestures  have  been  developed  over  several  years,  either  to  convey 
direction  information  or  to  clarify  ambiguous  speech-based  commands. 
Perzanowski  distinguished  natural  gestures  (e.g.,  pointing  a  finger  or  whole  ann) 
from  “synthetic”  gestures  (e.g.,  use  of  a  mechanical  device  such  as  mouse  or  stylus). 
The  act  of  pointing  is  considered  universally  intuitive,  as  exemplified  by  attempts 
by  infants  to  use  pointing  when  trying  to  grasp  objects  out  of  their  reach 
(Perzanowski  et  al.  2000b). 

While  the  pointing  gesture  is  natural  and  intuitive,  recognition  of  “where”  and 
“what”  can  be  challenging,  depending  on  task  context.  When  using  a  pointing 
gesture,  there  should  be  a  distinction  between  pointing  to  a  specific  object  (e.g., 
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“please  bring  me  that  cellphone”),  pointing  to  a  specific  location  (e.g.,  “through 
that  alley”)  and  pointing  to  a  generalized  area  (e.g.,  “search  that  area”).  Areas  can 
be  more  precisely  circumscribed  when  augmented  by  use  of  a  map  display  (e.g., 
circling  the  area  of  interest)  (Brooks  and  Breazeal  2006;  Perzanowski  et  al.  2000a, 
2000b).  Other  approaches  have  used  object  recognition  as  an  aid  to  gesture 
interpretation  (e.g.,  “bring  me  that  cup”).  A  different  approach  was  taken  by  Strobel 
et  al.  (2002),  where  they  used  the  spatial  context  of  the  environment  (e.g., 
orientation  of  object  to  axis  of  hand)  to  help  disambiguate  hand  gestures. 
Advancements  in  instrumented  glove  technology  are  enabling  determination  of 
azimuth  from  a  point  gesture;  when  combined  with  GPS-based  wearable  device, 
both  direction  and  distance  can  be  determined  through  sensors  within  the  glove 
(Vice  2015). 

Littmann  et  al.  (1996)  discuss  progress  on  a  point  and  place  task  for  a  robot  capable 
of  grasping  maneuvers  (e.g.,  “pick  up  that  object  and  place  it  over  there”)  using  2 
color  cameras  that  provide  stereo  data.  The  output  is  further  processed  by  color- 
sensitive  neural  networks  (NNs)  to  detennine  object  location.  In  a  laboratory 
setting,  the  accuracy  of  the  system  for  localization  was  that  of  1  cm  in  a  workspace 
area  of  50  x  50  cm.  The  2  cameras  must  show  both  the  human  hand  and  the  table 
with  objects.  In  addition,  each  training  exemplar  must  show  a  hand-pointing  posture 
associated  with  a  known  location. 

Abidi  et  al.  (2013)  used  a  Kinect  sensor  to  interpret  pointing  gestures  that  directed 
a  robot  to  move  to  a  location  that  the  user  pointed  to  on  the  ground.  The  3-D  location 
of  a  person’s  right  elbow,  right  hand,  and  eye  gaze  were  captured  to  determine 
location.  Two  approaches  were  compared:  gesture  alone  and  gesture  combined  with 
eye  gaze.  While  participants  expressed  a  strong  preference  (i.e.,  62%  vs.  38%)  for 
the  combination  of  gesture  with  eye  gaze,  no  perfonnance  data  were  reported  with 
regard  to  accuracy  of  localization  by  the  robot.  An  advantage  to  this  approach  is 
that  the  robot  continually  assesses  the  pointing  gesture  while  moving,  allowing  for 
real-time  change;  however,  the  robot  must  always  keep  the  operator  in  sight.  This 
prevents  pointing  to  a  location  that  is  outside  this  range. 

1.3.4  Remote  Manipulation 

Ground-based  mobile  robots  are  often  used  for  remote  manipulation  of  objects.  In 
combat  situations,  this  capability  is  often  used  for  bomb  disposal  (Axe  2008). 
Several  efforts  have  been  reported  where  gestures  have  been  developed  for  remote 
manipulation.  Several  of  these  regard  the  development  of  service  robots  designed 
to  assist  people  in  locations  such  as  offices,  supermarkets,  hospitals,  and 
households.  Other  efforts  focus  on  assisting  users  in  more  dangerous  environments 
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such  as  hazardous  areas  or  space,  using  telepresence  and  teleoperation  (see  Basanez 
and  Suarez  [2009]  for  a  review  of  teleoperation  issues). 

One  primary  manipulation  common  to  most  applications  is  that  of  grasping. 
Grasping  consists  of  several  steps:  a)  perception  of  object,  b)  detennination  of 
object  fonn,  size,  orientation,  and  position,  c)  planning  the  grip,  d)  grasping  the 
object,  e)  moving  the  object  to  a  new  location,  and  f)  releasing  the  object  (Becker  et 
al.  1999).  Becker  and  his  associates  developed  a  camera-based  system  that  allows 
interpretation  of  gestures  that  indicate  the  choice  of  object,  the  grip  to  be  used,  and 
the  desired  final  position.  The  task  environment  was  restricted  to  that  of  a  tabletop. 
The  camera  was  a  dual-stereo  camera  with  3  degrees  of  freedom  (DOF)  (pan,  tilt, 
and  vergence)  with  both  color  and  monochrome  capabilities.  Recognition  software 
was  based  on  the  C++  library  FLAVOR  (“Flexible  Library  for  Active  Vision  and 
Object  Recognition”)  developed  at  the  Institut  fur  Neuroinfonnatik  (Bochum, 
Gennany).  Tracking  of  the  operator’s  hand  relied  on  motion  detection,  skin  color 
analysis,  and  a  stereo  cue.  Object  recognition  was  based  on  object  edges. 

Colasanto  et  al.  (2013)  describe  issues  and  challenges  inherent  in  developing 
successful  teleoperation  of  anthropomorphic  robotic  hand-arm  systems.  They 
support  the  approach  where  the  motion  is  made  with  the  human  hand  and  captured, 
to  enable  proper  imitation  by  the  robotic  hand.  While  visual-based  systems  have 
been  used  for  grasping  tasks,  they  usually  require  special  markers  on  the  hand  and 
still  have  problems  due  to  visual  occlusion.  They  describe  algorithmic  approaches 
that  used  an  instrumented  wired  glove  (i.e.,  Cyberglove  having  22  joint-angle 
measurements)  and  a  4-finger  robotic  hand.  They  describe  development  of  an 
effective  approach  combining  joint-space  mapping  and  fuzzy-based  pose  mapping 
that  accommodated  both  grasping  movements  and  more  free-fonn  movements. 

Stiefelhagen  and  associates  (2004)  reported  the  development  of  a  system  using 
speech,  gestures,  and  head  pose  to  accomplish  grasping  tasks  such  as  retrieving 
objects,  turning  a  lamp  on/off,  or  setting  a  table.  While  they  report  correct 
interpretation  of  each  component  (e.g.,  speech,  head  orientation,  pointing  gesture) 
and  of  the  combination  of  components,  they  did  not  report  performance  data  on  the 
manipulation  tasks  per  se.  Lii  and  his  associates  (2012)  describe  uses  of  a  library 
of  tasks  linked  to  gesture  commands  given  by  an  operator  wearing  a  Cyberglove. 
They  describe  teleoperation  of  grasps  and  manipulation  with  a  15DOF  robot  hand 
and  6DOF  object  manipulation,  using  different  grasp  approaches. 

Camera-based  approaches  have  also  been  used  for  remote  manipulation.  Lathuiliere 
and  Herve  (2000)  reported  successful  real-time  application  using  a  single  camera, 
where  the  operator  wore  a  dark  glove  marked  with  colored  cues.  This  approach  was 
used  for  2  different  grasping  maneuvers  (e.g.,  spherical  and  cylindrical).  Raheja  et 
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al.  (2010)  also  used  a  camera-based  approach  for  robotic  hand  control.  Using  a 
technique  described  by  Raheja  et  al.  (2009),  to  manage  light  illumination  and 
viewing  angle,  the  researchers  developed  a  system  to  recognize  hand  gestures. 
Eleven  different  gestures  were  used  to  control  a  robotic  hand.  The  gestures  were 
restricted  to  static  posturing  of  the  hand  (e.g.,  clenching  a  fist,  spreading  of  fingers, 
thumbs  up/down,  2-finger  “peace”  sign,  1  finger,  3  fingers).  Recognition  was 
reported  at  90%  but  was  very  dependent  on  proper  light  illumination.  Celik  and 
Kuntalp  (2012)  also  used  a  camera-based  approach  for  control  of  grip  commands 
and  compared  2  methods  for  gesture  recognition.  The  template -matching 
algorithm,  based  on  pre-stored  gestural  information,  was  found  to  be  faster  than  the 
signature  signal  algorithm  (based  on  identification  of  edge  data)  but  had  higher 
memory  requirements.  Their  results  appear  to  be  based  on  2  commands  (open 
gripper,  close  gripper),  therefore  these  differences  may  compound  as  the  number 
of  gestures  increase. 

Researchers  from  Keio  University  in  Yokohama,  Japan,  investigated  user 
preferences  for  gestural  control  of  a  robotic  “hand”  for  assistive  tasks  such  as 
assembly  (Wongphati  et  al.  2012).  Control  movements  included  6  direction 
movements  and  open-close  gripper  movements  that  were  applied  to  assist  users 
with  a  soldering  task.  While  many  gestures  are  developed  from  the  point  of  view 
of  engineering  (e.g.,  ease  of  detection),  in  this  study  the  focus  was  on  capturing 
gestures  that  were  more  intuitive  to  the  user.  While  results  are  preliminary  and 
exploratory,  trends  indicate  user  preference  to  use  one  or  both  hands,  as  opposed  to 
other  body  parts/movements. 

An  interesting  application  reported  by  Park  et  al.  (2005b)  is  that  of  a  “competition” 
robot  that  could  make  fighting  maneuvers  (stand  up,  hook,  turn  left,  turn  right,  walk 
forward,  back  pedal,  side  attack,  move  to  left,  move  to  right,  back  up,  both  punch, 
left  punch,  right  punch).  Pose  extraction  was  accomplished  through  skin  color 
regions  using  a  2-D  Gaussian  model  that  extracted  positions  of  the  face,  left  hand, 
and  right  hand.  Hidden  Markov  modeling  (HMM)  was  used  to  process  the 
continuous  stream.  Each  gesture  was  perfonned  100  times  by  10  different 
individuals.  Reliability  of  recognition  was  reported  to  be  98.95%. 

1.3.5  Robot-User  Dialog 

As  robots  become  more  autonomous,  command  of  the  robot  transitions  from  direct 
and  detailed  teleoperation  commands  to  higher-level  command  dialogs.  Several 
approaches  have  been  taken  to  establish  dialogs  that  allow  robots  to  acknowledge 
commands  and  ask  for  clarification.  At  the  least,  robots  have  to  inform  users  of 
their  status,  through  visual,  auditory,  or  other  means.  International  standards  exist 
for  visual  signals  of  danger  and  for  auditory  alarms.  However,  there  is  increasing 
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need  for  more  complex  dialog  to  attain  clarity  (e.g.,  “you  told  me  to  go  somewhere 
but  you  did  not  say  where”),  (Kennedy  et  al.  2007;  Perzanowski  et  al  2000a, 
2000b).  Many  efforts  are  currently  focused  on  developing  the  ability  for  the  robot 
to  not  only  maintain  a  dialog,  but  to  learn  new  interactions  over  time,  such  as  use 
of  gaze  direction  or  pointing  gestures  toward  an  object  of  interest  (Ou  and  Grupen, 
2009).  Taylor  and  associates  report  progress  for  greater  dialog  based  on  a  computer 
programming  architecture  (i.e.,  Soar)  for  artificial  intelligence  (Taylor  et  al.  2012). 
Soar  provides  a  robust  framework  for  knowledge  representation  and  logic  (i.e., 
“if-then”  rules).  These  efforts  represent  growing  focus  and  progress  with  regard  to 
communications  based  on  logic  and  reason. 

In  addition  to  enhanced  clarity  and  complexity  of  rational  commands,  there  are  also 
many  efforts  focused  on  social  aspects  of  human-robot  communications.  Muto  et 
al.  (2009)  explored  options  for  human-robot  dialogue  for  situations  involving  a 
social  robot  to  assist  elderly  users.  Findings  from  an  investigation  of  human-human 
dialogue  were  applied  to  develop  and  refine  the  robot’s  response  to  a  user  request, 
such  as  “could  you  give  me  a  wood  block”.  Human-human  interactions  were 
observed,  where  the  request  utterance  speed  was  varied.  The  average  pause  between 
request  and  response  was  300  ms.  In  subsequent  human-robot  interactions,  speed 
of  the  request  was  varied,  along  with  the  robot’s  response,  using  speech  and  gesture 
(i.e.,  head  nod,  grasping),  with  regard  to  timing,  order,  and  length  of  pause  after  the 
request.  Afterward,  these  interactions  were  rated  by  the  human  participant.  They 
found  that  older  participants  favored  the  condition  where  the  robot  response  timing 
matched  the  timing  of  the  request  utterance  (e.g.,  fast,  slow).  This  preference  was 
not  demonstrated  by  younger  participants.  The  use  of  the  older  versus  younger 
participants  was  exploratory;  there  were  no  a  priori  hypotheses  with  regard  to 
expected  differences.  However,  researchers  suggested  differences  might  be  in  part 
due  to  differences  in  subjective  time  perception.  It  may  also  be  due  to  younger 
participants  having  higher  levels  of  familiarity  with  technology  and  robots. 
Findings  suggest  the  need  for  more  research  in  this  area. 

While  much  of  the  progress  in  human-robot  dialog  is  based  on  speech,  there  are 
complementary  efforts  to  incorporate  gestures  to  enhance,  clarify,  and/or  replace 
speech,  when  appropriate.  St  Clair  and  his  colleagues  (2011b)  investigated  human- 
robot  effectiveness  in  visually  cluttered  environments  to  better  understand  the 
effect  of  visual  saliency  on  gesture  production  by  a  robot.  In  this  case  the  gestures 
of  interest  were  those  indicating  a  distal  location  or  item  to  elicit  clarification  from 
the  operator.  Pointing  gestures  produced  by  the  robot  included  orientation  of  the 
head  (e.g.,  “eye”  gaze  direction),  with  straight  arms,  with  bent  arms,  and  the 
combination  of  arm  and  head  gestures.  Target  items  were  varied  with  regard  to 
visual  salience,  distance,  angle,  and  pointing  modality.  Results  showed  that  the 
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pointing  modality  affected  results  to  a  greater  extent  than  that  of  visual  salience. 
The  combination  of  head  orientation  and  arm  gesture  was  more  accurate  than  either 
gesture  was  alone. 

Taylor  and  his  associates  (2014),  in  their  development  of  gestures  for  a  robotic 
mule,  offer  a  useful  taxonomy  for  the  classification  of  gestures  types,  which 
includes  the  following: 

•  Static  versus  dynamic:  Whether  the  gesture  is  a  “pose”  or  requires 
movement. 

•  Continuous  versus  discrete:  Whether  the  gesture  can  be  repeated  and  still 
be  recognized. 

•  1-arm  versus  2-arm:  Some  are  made  with  one  ann,  others  with  two. 

•  Inclusion  of  hand  pose:  Whether  the  gesture  is  arm-only  or  also  includes  a 
hand  position. 

•  Inclusion  of  body  reference:  Whether  the  gesture  is  hand-ann  only,  or  refers 
to  another  part  of  the  body  (e.g.,  tapping  the  head). 

•  Facing  toward  or  away  from  target  of  communication:  Some  gesture 
recognition  systems  require  facing  toward  the  receiver. 

•  Gestures  in  x-y  plane  versus  x-y-z  plane:  Some  gestures  may  have  more 
complex  movement  in  3-D  space. 

•  Unique  versus  dependent:  Whether  gesture  has  only  one  meaning  or  is 
dependent  on  context. 

These  distinctions  become  more  useful  and  relevant  as  other  issues  are  also 
considered  (e.g.,  purpose  of  gesture,  type  of  recognition  system). 

1.3.6  Summary 

Designers  and  managers  must  give  careful  consideration  to  operational  task 
demands  to  develop  or  select  the  appropriate  system  for  diverse  commands,  ranging 
from  simple  commands  to  more  complex  human-robot  dialogs.  The  need  for  remote 
manipulation  or  accurate  communication  of  azimuth  and  distance  (i.e.,  pointing) 
introduces  additional  challenges  and  considerations.  Both  camera-based  and 
instrumented  systems  vary  in  effectiveness  for  the  different  types  of  task  demands. 
All  varieties  have  demonstrated  effectiveness  for  simple  commands.  All  gesture 
systems  are  more  challenged  when  accomplishing  pointing  gestures,  and  may  need 
augments  such  as  visual  display,  speech,  or  use  of  a  handheld  pointer,  to  better 
localize  the  target.  Instrumented  gloves  can  have  an  advantage  over  camera-based 
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systems,  to  accomplish  more  complex  and/or  manipulation  commands,  particularly 
in  situations  having  high  visual  clutter  or  greater  distances  between  the  operator 
and  robot.  Future  concepts  include  more  bidirectional  dialogs  between  the  operator 
and  robot.  In  addition,  operational  context,  such  as  visibility,  weather,  and  need  for 
colocation  or  direct  view  of  the  robot,  will  also  impact  the  effectiveness  of  a  given 
gesture  system. 

2.  Camera-Based  Systems 


2.1  Overview  and  Options 

Camera-based  gesture  recognition  systems  have  used  a  variety  of  camera  types  to 
capture  images  that  are  then  interpreted  with  some  form  of  quantitative  algorithmic 
interpretation  of  the  video  (Flassanpour  et  al.  2008).  This  type  of  approach  is  also 
referred  as  vision-based  recognition  (Gavrilla  1999;  Wu  and  Huang  1999). 
Applications  have  ranged  from  a  small  set  of  body  postures  to  large  sets  of  complex 
hand  gestures  (e.g.,  ASL).  A  well-known  application  is  that  of  the  Kinect  system, 
where  camera-based  interpretation  of  user  body  posture  and  movements  serves  to 
control  videogame  features  and  feedback. 

It  should  first  be  noted  that  camera-based  systems  vary  widely — with  regard  to  type 
of  camera  sensor,  number  of  cameras,  and  various  algorithms  used  for  gesture 
recognition.  More  recent  advancements  are  as  follows. 

2.1.1.  3-D  Data  Recognition 

More-recent  improvements  to  camera-based  recognition  systems  strive  to  better 
capture  3-D  movement.  Progress  in  this  area  has  been  achieved  over  time.  Early 
attempts  were  fairly  bulky  (Kortenkamp  et  al.  1996).  The  3-D  camera  sensors  offer 
an  alternative  that  may  alleviate  some  of  the  problems  with  2-D  camera  displays. 
Vogler  and  Metaxas  (2001)  developed  parallel  HMMs  to  recognize  ASL  gestures 
using  a  3-D  camera  system.  Xie  et  al.  (2010)  describe  a  stereo  vision-based 
approach  to  identify  the  trajectory  of  a  dynamic  gesture,  reporting  a  recognition 
rate  of  92%.  Gordon  et  al.  (2008)  used  2  cameras  to  achieve  stereo  vision  to  attain 
depth  information.  The  3-D  approach  to  hand  postures  needs  a  large  enough  data 
set  to  cover  the  large  number  of  possible  hand  shapes  from  various  viewing  angles. 
While  successes  have  been  reported  with  regard  to  3-D-based  real-time  recognition 
of  human  shapes  and  movements,  results  are  limited,  restricted,  and  more  relevant 
to  surveillance  as  opposed  to  robot  control  per  se  (Checka  and  Demirdjian  2010; 
Cohen  2013). 
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The  3-D  sensors  with  additional  depth  perception  data  (e.g.,  Kinec,  etc.)  offer  even 
more  detailed  data  for  interpretation.  Use  of  depth  perception  data  allows  the  user 
more  range  of  motion  that  can  be  interpreted  by  the  system.  Several  studies  have 
used  the  Kinect  depth  sensor  system  (Cheng  et  al.  2012).  Yanik  et  al.  (2012)  used 
Kinect  technology  to  build  recognition  of  3  hand  signals  taken  from  ASL  to 
command  assistive  robots  (i.e.,  “come  closer”,  “go  away”,  “stop”).  Their  approach 
used  the  Growing  Neural  Gas  (GNG)  algorithm  that  is  more  robust  to  variations  in 
gesture  execution.  Skeletal  depth  data  collected  by  the  Microsoft  Kinect  sensor  was 
clustered  with  the  GNG.  They  report  initial  progress  on  their  goal  to  develop  a 
system  that  will  improve  with  user  feedback.  Lai  et  al.  (2012)  reported  2  approaches 
to  gesture  recognition  of  8  hand  gestures,  both  using  Kinect  depth  data  and  both 
resulting  in  over  99%  accuracy  of  real-time  recognition.  While  both  approaches 
were  accurate,  the  one  based  on  Euclidean  distance  was  associated  with  more 
limitations.  Barattini  et  al.  (2012)  developed  a  gesture  set  for  control  of  industrial 
robots,  based  on  distinguishability  data  (i.e.,  confusion  matrix),  as  detennined 
based  on  use  of  the  Microsoft  Kinect  system  and  dynamic  time  warping  (DTW) 
recognition  algorithms.  Others  have  also  applied  the  Kinect  capabilities  for 
industrial  robots  (Lambrecht  et  al.  2011)  and  for  humanoid  robots  (Suay  and 
Chernova  2011).  Masum  et  al.  (2012)  applied  Kinect  capabilities  for  whole  body 
gesture  recognition,  for  the  naturalistic  control  of  a  humanoid  robot.  They  reported 
99.87%  accuracy  using  the  Fuzzy  Neural  Generalized  Learning  Vector 
Quantization  algorithm.  Konda  et  al.  (2012)  used  depth  perception  data  to  better 
adapt  to  outdoors  conditions. 

Xu  et  al.  (2012)  described  the  development  of  a  gesture  control  system  for  robot 
control,  using  Kinect  depth  perception  data  for  the  control  of  a  human  service  robot. 
HMMs,  using  left-right  topology  to  identify  hand  trajectory,  are  used  to  model  and 
classify  3-D  tracking  of  2  hands  to  distinguish  1)6  navigation  commands  used  by 
the  right  hand:  turn  right,  turn  left,  move  forward,  move  backward,  rotate,  and  brake 
and  2)  use  of  the  left  hand  for  velocity  (robot  speed)  control.  The  system  was  able 
to  recognize  gestures  by  both  hands  at  the  same  time.  Start  and  end  of  a  gesture  was 
detennined  through  a  detection  method  combining  an  initial  region  with  hand 
velocity.  While  assessment  of  real-time  perfonnance  was  accomplished  in 
laboratory  settings,  the  system  was  stated  as  robust  to  changes  in  illumination,  color 
variations,  and  background  clutter.  The  system  was  evaluated  using  gestures 
generated  by  6  people  that  each  performed  60  gestures.  Gesture  recognition  rates 
ranged  from  96%  to  100%  for  the  right-hand  gesture,  and  100%  with  the  left  hand. 
These  rates  were  compared  to  recognition  rates  based  on  skin  color  segmentation 
(Elmezain  et  al.  2008)  and  DTW  (Wu  et  al.  2010).  While  statistical  significance  is 
not  reported,  recognition  rates  were  similar  or  higher  using  the  depth  data  approach, 
and  lower  using  the  dynamic  time  warping  approach.  Recognition  time  was  stated 
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to  be  based  on  time  to  complete  a  gesture,  about  0.5  s.  Navigation  accuracy  was 
described  through  diagrams  representing  ground  truth  (i.e.,  the  path  that  the  robot 
was  supposed  to  traverse)  and  actual  performance.  Diagrams  were  quite  similar  in 
pattern;  however,  there  was  no  perfonnance  data  with  regard  to  navigation  errors. 

Li  and  Pan  (2012)  used  Kinect  technology  (e.g.,  skeleton  tracing)  and  a  DTW 
recognition  algorithm  for  real-time  control  of  a  ground  robot  using  1 1  arm-based 
gestures  (e.g.,  left  hand  to  right;  right  hand  to  left,  2  hands  zoom  in,  2  hands  clap, 
etc.).  Using  5  people,  with  each  person  demonstrating  each  gesture  20  times,  they 
reported  accuracies  from  90%  to  96%.  They  stated  response  times  were  short  but 
did  not  report  actual  times.  They  also  reported  that  the  user  was  able  to  use  real¬ 
time  returned  video  to  manipulate  the  robot  to  avoid  obstacles  and  successfully 
reach  its  destination. 

Sugiyama  and  Miura  (2013)  reported  an  approach  that  integrates  a  head-mounted 
camera  with  a  9-axis  orientation  sensor  and  hand-worn  inertial  sensors.  Sensor 
infonnation  is  fused  to  estimate  walking  motion  and  hand  motions,  which  then 
drives  the  movement  (walking  and  arm  motion)  of  a  humanoid  robot.  The  camera 
detects  the  hand  through  skin  color  and  linger  edges  and  can  detect  movement  and 
grasping  gestures.  The  user  was  able  to  successfully  control  robot  movements.  The 
primary  application  was  that  of  remote  telepresence  control.  Gopalakrishnan  et  al. 
(2005)  discussed  how  camera-based  systems  could  be  augmented  by  integration 
with  laser-based  localization,  a  visual  map  display,  and  speech  recognition 
capabilities. 

2.1.2  Multicamera  Networks 

Multicamera  networks  show  much  promise  with  regard  to  recognition  of  a  variety 
of  types  of  gestures  in  a  complex  environment  (Aghajan  and  Wu  2007);  however, 
feasibility  of  such  networked  systems  in  operational  environments  is  limited.  One 
application  that  has  reported  success  with  this  approach  used  multiple  cameras  to 
control  a  robotic  forklift,  such  that  the  object  can  be  recognized,  and  after  a  period 
of  time,  reacquired,  even  after  the  robot  has  moved  long  distances.  First,  the 
operator  gives  the  robot  a  “guided  tour”  of  named  objects  and  the  locations.  Then 
the  operator  can  dispatch  the  robot  to  fetch  a  particular  object  by  name  (Walter  et 
al.  2010). 

Park  et  al.  (2005b)  used  a  multicamera  system  to  enable  2  robot  controllers  to 
participate  in  an  indoor  robot  competition  that  involved  fighting  movements 
between  2  small  humanoid-like  robots  (e.g.,  similar  to  transformer  toys).  This 
camera-based  approach  used  recognition  of  skin  color  regions  (i.e.,  face,  left  and 
right  hands)  of  each  operator  for  pose  extraction  and  HMM-based  gesture 
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recognition.  Each  of  the  13  gestures  was  performed  100  times  by  10  different 
operators,  resulting  in  an  overall  reliability  of  98.9%. 


2.1.3  Recognition  Algorithms 

At  this  time,  several  reviews  agree  that  the  main  technological  challenge  of  camera- 
based  systems  rests  with  the  efficacy  of  feature  capture  and  gesture  recognition 
processes,  such  as  the  review  by  Rautaray  and  Agrawal  (2012).  In  another  recent 
review,  Khan  and  Ibraheem  (2012)  describe  these  phases  as  extraction  method  (e.g., 
image  capture),  features  extraction,  and  classification  (e.g.,  recognition).  Image 
capture  approaches  include  2-D  monocular  systems,  3-D  stereo  systems,  3-D 
systems  with  more  advanced  depth  infonnation  (e.g.,  Kinect),  and  multiple-camera 
systems.  While  capabilities  among  these  systems  vary,  they  each  face  similar 
challenges  and  assumptions,  such  as  the  need  for  high-contrast  backgrounds  under 
optimal  ambient  lighting  conditions.  An  appearance-based  approach  uses  a  simpler 
2-D  model;  still,  extracting  the  image  of  hand  movements  in  a  cluttered 
environment  can  be  challenging.  Murthy  and  Jadon  (2009)  also  provided  a  review 
of  issues  related  to  vision-based  gesture  recognition,  noting  that  such  systems  are 
more  effective  in  controlled  environments.  Problems  arise  from  less-than-optimal 
illumination  and  visual  noise  (Kang  et  al.  2009). 

Camera-based  gesture  recognition  relies  on  some  type  of  algorithmic  approach  to 
extract,  parse,  and  classify  gestures.  The  challenge  to  all  approaches  is  gesture- 
spotting,  which  is  the  extracting  of  the  start  and  stop  points  of  specific  gestures 
from  a  continuous  stream  of  dynamic  gestural  movement  (Alon  et  al.  2005).  At  this 
time,  all  approaches  are  associated  with  some  level  of  error.  Recognition  of 
dynamic  gestures  has  often  been  accomplished  with  some  variation  of  HMM. 
HMM  is  a  statistical  procedure  that  builds  upon  Markov  models,  in  that  they  can 
make  inferences  based  on  data  that  are  probabilistic  and  sequential,  such  as  speech 
recognition  (Baker  1975).  Many  refinements  have  been  explored  and  applied,  finite 
state  machines  (FSMs),  artificial  neural  networks  (ANNs),  continuous  dynamic 
programming  (Alon  et  al.  2005),  and  DTW  (Kobayashi  and  Haruyama  1997;  Shah 
and  Jain  1997;  Wu  and  Huang  1999;  Corradini  and  Gross  2000;  Urban  et  al.  2004; 
Li  and  Pan  2012).  Rao  and  Mahanta  (2006),  instead  of  analyzing  all  frames  of  a 
video  feed,  applied  methodology  to  extract  and  analyze  a  subset  of  key  frames. 
They  also  used  a  clustering  algorithm  on  static  gestures  and  reported  success  on 
5,000  gestures,  with  recognition  rates  from  84%-100%.  Similarly,  Chian  et  al. 
(2008)  reported  a  method  to  more  efficiently  parse  gesture  classifications. 
Corradini  and  Gross  (2000)  compared  different  algorithms  with  regard  to 
recognition  rates  for  a  set  of  5  basic  gestures.  These  included  combination  NNs 
with  HMM,  combination  of  radial  basis  function  (RBF)  with  HMM,  and  recurrent 
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NN.  While  the  HMM/RBF  approach  had  somewhat  higher  recognition  rates,  the 
data  were  insufficient  to  draw  conclusions  other  than  that  all  approaches  were 
associated  with  fairly  high  recognition  rates  associated  with  a  small  set  of  gestures, 
accomplished  in  laboratory  conditions. 

Most  approaches  to  camera-based  recognition  rely  on  preprogrammed  algorithms 
based  on  extensive  repetition  of  a  small  set  of  gestures.  However,  there  are  attempts 
to  make  the  recognition  process  more  dynamic  and  user  friendly.  Hashiyama  et  al. 
(2006)  used  camera-based  information  in  such  a  way  as  to  allow  users  to  create 
their  own  gestures  for  specific  commands.  Users  “show”  the  gesture  to  the  robot 
system,  which  then  learns  to  recognize  the  gesture,  for  a  particular  command.  Users 
were  able  to  accomplish  this  within  30  min. 

To  summarize,  many  different  recognition  algorithms  and  strategies  are  currently 
being  investigated.  No  one  particular  approach  has  been  established  as  best;  instead, 
it  is  likely  that  ideal  solutions  will  be  situational,  depending  on  factors  such  as 
situation  context  (e.g.,  indoors,  lighting),  number  of  gestures,  and  type  of  gestures 
(e.g.,  range  of  movement).  In  the  following  section  we  discuss  some  factors 
contributing  to  recognition  effectiveness. 

2.2  Military  Applications:  Camera  Systems 

Camera-based  gesture  control  systems  have  been  investigated  for  several  military 
applications,  including  robot  control.  Other  applications  are  also  included  in  this 
section,  as  issues  regarding  use  and  effectiveness  generalize  among  the  various  task 
demands. 

2.2.1  Dismount  Soldier  Communications 

Cohen  (2000,  2005)  developed  and  demonstrated  a  prototype  gesture  recognition 
system  for  dismounted  Soldiers  based  on  camera-based  gesture  recognition  of 
existing  Army  hand  and  arm  signals  (e.g.,  Army  Field  Manual  21-60 
[Headquarters,  Department  of  the  Army  1987]).  The  approach  for  this  technology 
was  developed  and  demonstrated  previously  (Beach  and  Cohen  2001).  The  purpose 
was  to  prove  the  capability  for  technology-based  recognition  of  Soldiers  using  a 
subset  of  Army  hand  and  arm  signals.  The  capability  was  developed  to  enhance 
training  of  gesture -based  communications.  The  system  would  monitor  if  the  proper 
gesture  was  performed  adequately  and  if  not,  show  how  to  perform  the  required 
gesture.  The  system  also  had  a  goal  to  be  able  to  leam  new  gestures.  Dynamic 
gestures  included  “slow  down”,  “prepare  to  move”,  and  “attention”.  Static  gestures 
included  “stop”,  “right/left  turn”,  “okay”,  and  “freeze”.  In  this  case,  the  dynamic 
gestures  included  the  actual  movement  of  the  arms  as  part  of  the  gesture.  It  is  also 
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possible  for  camera-based  systems  to  recognize  static  gestures,  where  a  single, 
static  position  conveys  the  gesture  meaning,  and  does  not  include  the  actual 
movement  as  part  of  the  gesture.  They  also  used  a  variety  of  other  gestures,  based 
on  circles  and  lines,  to  check  for  recognition  performance.  Recognition  rates  varied 
from  80%-100%,  with  many  gestures  recognized  at  95%-100%  accuracy.  This 
camera-based  approach  to  recognition  of  gestures,  motion  tracking,  and  feature 
matching  has  been  applied  to  numerous  surveillance  applications  (Cohen  2013). 

2.2.2  Robot  Control 

Perzanowski  and  his  associates  (Perzanowski  et  al.  1998;  2000a,  2000b;  2002, 
2003)  reported  progress  toward  a  multimodal  approach  to  control  of  single  and 
multiple  robots  using  gesture,  personal  digital  assistant  (PDA),  and  speech. 
Syntactic  and  semantic  information  is  drawn  using  ViaVoice  speech  recognition 
and  natural  language  understanding  system,  Nautilus  (Wauchope  1994).  Visual 
cues  include  body  location,  eye  gaze,  or  other  types  of  body  language.  This  is 
supplemented  by  recognition  of  gestures,  such  as  pointing.  In  the  2002 
instantiation,  gesture  recognition  was  based  on  a  camera  with  a  structured  light 
rangefinder  mounted  to  the  side  of  the  robot  to  track  the  user’s  hands,  while  sonar 
sensors  are  used  to  detect  objects  in  the  environment.  In  2003,  a  Wizard  of  Oz 
(WoZ)  experiment  was  conducted  to  explore  naturally  occurring  preferences  with 
regard  to  the  use  of  gestures  and  language  syntax.  Natural  language  and  spatial 
relationships  are  based  on  an  approach  described  by  Skubic  (Skubic  et  al.  2001a, 
2001b,  2002,  2004).  Skubic  and  her  research  associates  explored  linguistic  spatial 
relationships  and  directives  (e.g.,  “go  around  the  desk  and  through  the  doorway”) 
when  referring  to  an  evidence  grid  map.  The  evidence  grid  map  is  built  from  range 
sensor  data,  where  objects  are  assigned  labels  provided  by  the  user.  Robot  software 
includes  spatial  reasoning  that  enables  the  robot  to  understand  linguistic 
descriptions  (e.g.,  “object  is  behind  the  desk”)  or  commands  (e.g.,  “go  through  the 
doorway  then  stop”). 

Sofge  et  al.  (2003a)  described  an  agent-based  approach  to  control  an  autonomous 
robot  using  natural  language,  spatial  reasoning,  and  gesture  interpretation.  The 
gestural  system  included  a  structured-light  rangefinder  that  emitted  a  horizontal 
plane  of  laser  light  and  a  camera  mounted  on  the  robot  with  an  optical  filter  tuned 
to  the  laser  frequency.  The  camera  generated  a  depth  map  from  the  reflection  of  the 
laser  off  objects  in  the  room.  Hand  gestures  were  recognized  because  they  were 
closer  to  the  camera  than  other  objects  and  were  then  processed  to  generate 
trajectories  that  are  used  for  gesture  recognition.  Gestures  indicated  direction,  and 
were  integrated  with  speech  commands  such  as  “go  over  there”.  The  focus  of  this 
effort,  sponsored  by  Defense  Advanced  Research  Projects  Agency  (DARPA),  was 


Approved  for  public  release;  distribution  is  unlimited. 


on  the  use  of  agent-based  capabilities  to  integrate  screen-based,  visual,  and  gestural 
commands  along  with  object  recognition  and  spatial  reasoning. 


Brooks  (2005)  and  Brooks  and  Breazeal  (2006)  reported  progress  toward 
naturalistic  interaction  with  robots  for  Soldier  tasks,  which  included  robot 
capability  to  leam  and  imitate  tasks,  camera-based  recognition  of  social  interaction 
cues  (e.g.,  gaze  direction,  nodding,  facial  expressions,  etc.),  and  codified  verbal 
expressions.  In  this  case,  the  robot  is  not  controlled  directly  by  gestures;  instead, 
the  robot  visual  attention  system  attempts  to  monitor  and  recognize  gestures  and 
facial  expressions  of  the  operator  to  ascertain  stimuli  of  interest.  Robot  capabilities 
included  detection  of  head  orientation,  body  motion  mimicry,  hand  reflex,  figure- 
ground  segmentation,  and  response  to  operator  gestures  and  touch.  In  addition, 
gesture  recognition  is  integrated  with  a  representational  language  for  humanoid 
movement,  with  the  goal  of  mimicry.  Operational  goals  include  scenarios  where 
the  user  can  show  the  robot  what  to  do  (e.g.,  open  a  gas  tank,  stack  boxes,  etc.). 

Kennedy  et  al.  (2007),  using  ViaVoice,  programmed  speech  and  gesture  commands 
for  a  core  set  of  robot  control  commands  relevant  to  Marine  reconnaissance 
missions:  “attention”,  “stop”,  “assemble”  (i.e.,  come  here),  “as  you  were”  (i.e., 
continue),  “report”  (which  assumes  the  robot  can  communicate  to  the  user).  The 
first  4  commands  were  taken  from  the  US  Marine  Corps  Rifle  Squad  manual 
(Headquarters,  Department  of  the  Navy  2002).  The  robot  interacted  with  a  team 
member  through  voice,  gestures,  and  movement,  using  artificial  intelligence 
programming  (i.e.,  ACT-R)  capability  with  additional  capacity  for  spatial  reasoning 
and  perspective  taking.  Together,  the  team  member  and  robot  were  tasked  to 
covertly  follow  and  approach  a  moving  target  (e.g.,  another  human  or  robot).  The 
target  continually  moved  to  various  locations  and  had  a  limited  field  of  view  in 
which  to  spot  any  followers.  The  robot  had  logical  reasoning  to  enable  covert 
approach,  such  as  “if  the  target  is  on  the  north,  east,  south,  or  west  side  of  an  object, 
it  should  hide  on  the  opposite  side  of  the  object”.  While  the  emphasis  of  this  effort 
was  on  spatial  reasoning,  gestures  and  speech  were  used  for  intercommunications. 
However,  it  is  not  clear  from  the  report  as  to  the  extent  to  how  many  user-to-robot 
communications  were  used  or  successfully  interpreted. 

Jones  (2007)  reported  development  of  a  prototype  robotic  system  capable  of 
detecting  and  following  a  person  through  indoor  and  outdoor  environments  while 
responding  to  voice  and  gesture  commands.  A  camera-based  system  was  mounted 
to  the  arm  of  a  Packbot,  placing  the  camera  about  5  ft  above  the  ground,  allowing 
the  view  of  a  person’s  upper  body  and  gestures.  The  sequence  of  observed  arm 
poses  was  matched  to  a  complete  sequence  corresponding  to  a  known  gesture  (e.g., 
wait,  follow,  breach  doorway)  through  HMM  algorithms  (Rabiner  1989). 
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Ruttum  and  Parikh  (2010)  reported  development  of  a  gestural  robot  control  system 
using  core  hand  and  arm  signals  used  by  the  Marines.  They  focused  on  4  signals, 
“come”,  “go  here/move  up”,  “stop”,  and  “freeze”  and  identified  distinguishing 
factors  from  arm,  hand,  and  body  orientation/velocity/acceleration.  Their  approach 
was  camera-based  with  real-time  analysis  of  continuous  video  based  on  a  Vicon 
motion  detection  system.  This  system  is  stated  to  reduce  the  limitations  with  regard 
to  background  clutter  and  orientation  of  the  person  to  the  camera.  The  assessment 
was  conducted  indoors  (in  an  area  measuring  30  x  25  x  9  ft).  The  subject  was  tagged 
with  markers  that  are  detected  by  infrared  cameras  set  up  within  the  room.  They 
also  compared  2  methods  of  analysis:  Bayesian  and  NNs.  Preliminary  results  were 
reported  as  inconclusive;  however,  the  authors  stated  higher  expectations  for  the 
NNs  as  data  become  more  complex. 

Advancements  in  camera-based  interpretation  of  human  movements  have  evolved 
to  enable  recognition  and  tracking  (Gavrilla  1999).  Kania  and  Del  Rose  (2005) 
describe  successful  application  of  camera-based  techniques  to  detect  pedestrian 
movements  and  thus  augment  robot  leader-follower  performance. 

Camera-based  recognition  concepts  have  been  demonstrated  for  leader-follower 
tasks,  and  other  simple  gestural  commands,  as  shown  in  Fig.  1.  There  is  also  the 
potential  use  of  more  advanced  applications  of  camera-based  recognition,  as  it  more 
closely  approximates  actual  computer  vision.  For  example,  given  surveillance  as  a 
mission  goal,  a  camera-based  computer  vision  system  can  serve  to  interpret 
surroundings  in  the  environment,  as  well  as  the  operator  gestures  (e.g.,  threat 
assessment  based  on  motion,  etc.)  (Cohen  2005,  2013). 


Fig.  1  Camera-based  control  of  robotic  mule  (Taylor  et  al.  2012) 

2.2.3  Aircraft  Direction 

There  are  ongoing  efforts  to  build  gesture  recognition  for  aircraft  and  unmanned 
aerial  vehicle  (UAV)  handling,  which  have  similar  task  demands  to  ground  robot 
control.  Recent  efforts  to  develop  camera  vision-based  recognition  of  aircraft 
handling  hand  and  ann  signals  have  included  work  by  Choi  et  al.  (2008),  as  well  as 
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a  current  Office  of  Naval  Research-funded  research  and  development  effort  being 
conducted  at  the  Massachusetts  Institute  of  Technology  by  Song  et  al.  (2011a, 
2011b,  2012).  Choi  et  al.  report  overall  accuracy  of  99%,  but  for  a  training  set 
consisting  of  multiple  repetitions  of  only  8  gestures,  while  Song  et  al.  have 
demonstrated  gesture  recognition  accuracy  of  75.37%  for  a  subset  of  24  aircraft 
handling  gestures.  In  both  cases,  data  were  collected  within  a  highly  controlled 
laboratory  setting  in  which  lighting  was  controlled,  visual  noise  was  eliminated,  an 
optimized  field  of  view  (FOV)  and  distance  to  the  user  were  ensured,  and  hand  and 
ann  signals  were  generated  by  individuals  that  were  standing  still,  rather  than 
interacting  within  a  complex  and  dynamic  environment.  Thus,  it  is  unclear  how 
well  these  systems  would  operate  in  challenging  circumstances  to  meet  operational 
needs. 

Ablavsky  (2004)  also  reported  progress  toward  proof  of  concept  with  regard  to  the 
use  of  a  camera-based  gesture  recognition  system  for  the  direction  of  UAV 
movements  on  aircraft  carrier  decks.  The  passive  camera  system  used  a  wide  FOV 
for  recognition  of  blinking  beacons  and  a  narrow  FOV  for  observing  hand  and  ann 
gestures.  In  contrast,  Urban  et  al.  (2004)  used  a  motion  tracker  along  with  wearable 
sensors  toward  the  same  task  goals.  Sensors  were  attached  to  armbands  worn  by 
the  operator.  Two  sensors  on  each  ann  (i.e.,  upper  and  lower  arm)  were  detennined 
to  be  sufficient  for  most  gestures  in  the  Navy  gesture  lexicon  for  control  of  aircraft 
on  the  ground.  While  gestures  were  accurately  recognized,  there  were  problems 
associated  with  wired  sensors  (e.g.,  tangled  wires,  necessity  of  being  near  the  base 
device)  and  fatigue  associated  with  bulky  sensors,  indicating  the  need  for  smaller, 
wireless  sensors. 

2.2.4  Camera-Based  Gestural  Systems:  Constraints  in  Military 
Operations 

There  are  a  number  of  things  to  think  about  when  technology  is  considered  for  use 
in  military  operations.  Military  operations  have  a  number  of  characteristics  that  will 
be  more  challenging  than  simple  tasks  in  controlled  settings.  Camera-based  systems 
that  have  been  shown  to  be  successful  in  controlled  laboratory  settings  may  not 
generalize  to  military  operations.  The  following  bullets  discuss  how  camera-based 
gesture  recognition  systems  might  fare  under  some  military  operations 
circumstances. 

•  Line  of  sight.  A  primary  constraint  is  line  of  sight:  the  camera  and  operator 
must  be  within  sight  of  each  other. 

•  Visual  clutter  in  operational  environments.  Clutter  represents  increased 
challenges  due  to  distance  and  angle  from  the  camera,  varied  contrast 
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backgrounds,  and  visual  degradation  from  smoke,  exhaust,  haze,  and 
inclement  weather. 


•  Visual  clutter  due  to  additional  personnel.  Camera -based  systems  may  have 
difficulty  recognizing  the  operator  if  other  personnel  (e.g.,  squad  members) 
are  within  the  visual  field. 

•  Visibility.  Camera-based  recognition  is  degraded  or  ineffective  in  poor 
visibility  due  to  smoke,  fog,  and  night  operations.  This  can  be  ameliorated 
with  infrared  or  thennal  cameras. 

•  Complex  fast  movements.  High  variation  of  body  movements,  the  self¬ 
occlusion  of  one  body  part  by  another,  and  fast  movements  of  anns  and  legs 
(Checka  and  Demirdjian  2010;  Yanik  et  al.  2012). 

•  Latency.  Latency  can  be  a  particular  problem  when  operational  tempo  is 
fast  and  movements  are  quick. 

•  Generalizability  of  recognition  to  a  wider  range  of  users  remains  uncertain 
as  the  perfonnance  metrics  are  typically  based  on  a  training  set  developed 
from  a  small  number  of  participants  in  ideal  circumstances  (Garg  et  al. 
2009). 

•  Complex  environments.  The  hand  gesture  detection  methods  that  are  based 
on  skin  color,  2-D,  or  3-D  template  matching  are  not  sufficiently  robust  due 
to  the  many  degrees  of  freedom  with  regard  to  hand  movements,  unobvious 
exterior  features  of  hands,  illumination  changes,  and  varied  colored  and 
cluttered  backgrounds  (Xu  et  al.  2012). 

•  Multiple  cameras  are  not  easily  used  on  the  go. 

Table  1  lists  these  constraints,  along  with  impact  on  operations  and  ways  in  which 
the  constraint  can  be  managed. 
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Table  1  Typical  constraints  in  military  operations  associated  with  camera-based  gestural 

systems 


Issue 

Operational  impact 

Potential  solutions 

Line  of  sight 

Must  use  within  line  of  sight 
(e.g.,  leader-follower 
scenario) 

Future:  multiple  cameras/perspectives 

Visual  clutter 

Use  in  noncl  uttered 
environments  (e.g.,  roads, 
simple  terrain) 

Augment  with  GPS,  other  wearable 
markers 

Visibility 

Limit  to  high-visibility 
operations/daytime  use 

Augment  with  infrared/thermal  cameras 

Complex/fast 

movements 

Limit  to  few  simple 
commands/keep  operator 
separate  from  others 

Augment  with  GPS/wearable  markers 

Latency 

Do  not  use  in  high-tempo 
operations 

Technology  improvements 

Generalizability  to 
other  operators 

Train  camera/operator  as  a 
system 

Algorithm  improvements/wearable 
markers 

Multiple  cameras 

Static  situations  (e.g., 
indoors) 

Technology  improvements  for 
integration  of  multiple  cameras  on  the 
go 

Pointing  gestures 

Pointing  not  well 
accomplished 

Augment  with  speech  recognition; 
Improved  technology 

2.2.5  Approaches  to  Enhance  Camera-Based  Recognition 

A  strong  advantage  of  camera-based  systems  in  military  operations  is  the  direct  link 
between  the  camera  and  the  operator.  No  wireless  transmissions  are  needed  to 
achieve  communications,  thus  avoiding  issues  regarding  signal  strength  or 
jamming.  This  issue  can  be  the  deciding  factor  for  some  military  situations. 
Potential  advantages  also  arise  when  camera-based  vision  systems  serve  multiple 
purposes,  based  on  general  recognition,  not  only  of  gestures,  but  of  objects 
(stationary  and  moving)  and  potential  threats. 

However,  camera-based  system  effectiveness  can  be  degraded  in  cluttered 
environments  or  if  the  operator  is  out  of  the  line  of  sight.  There  are  a  number  of 
approaches  used  to  enhance  camera-based  recognition.  As  discussed  in  the 
following  paragraphs,  these  approaches  include  making  the  gesture  more  “visible” 
(e.g.,  wearable  markers),  controlling  the  environment,  or  enhancing  the  camera 
technology. 

2. 2. 5.1  Wearable  Markers 

A  major  challenge  in  more  complex  settings  is  the  capability  of  recognizing  a 
discrete  gesture  from  a  continuous  stream  of  motion  against  a  cluttered  background. 
Some  success  has  been  reported  using  real-time  continuous  data  (Alon  et  al.  2005). 
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Some  progress  has  been  made  with  regard  to  vision-based  gestures  in  loud  and 
cluttered  settings,  where  user-home  devices  such  as  microphones  or  gloves  were 
not  an  option  (Barattini  et  al.  2012). 

The  efficacy  of  camera-based  gesture  recognition  can  be  aided  by  use  of  markers 
(e.g.,  special  clothing,  colors,  focus  on  skin  tones,  etc.)  to  simplify  and  speed  the 
recognition  process.  Waldherr  et  al.  (2000)  used  face  color  and  shirt  color  as 
features  to  track  movements.  Manigandan  and  Jackin  (2010)  used  skin  color  and 
textures  to  simplify  recognition  of  hand  postures.  Singh  et  al.  (2005)  had  the  user 
wear  clothes  with  colors  that  stand  out  from  the  background,  with  recognition  based 
on  motion  and  color  cues  and  accuracy  up  to  90%  for  1 1  commands.  Malima  et  al. 
(2006)  reported  fast  and  efficient  recognition  of  “counting”  gestures  through 
segmentation  of  the  hand  based  on  skin  color  and  size.  Stancil  et  al.  (2012) 
described  development  of  a  robotic  mine  dog  (Neya  Systems  LLC)  that  located  its 
operator  through  recognition  of  body  shape,  posture,  and  a  jacket  worn  by  the 
operator. 

2. 2. 5. 2  Controlled  Environment 

Demonstrations  of  camera-based  recognition  often  occur  in  controlled  laboratory 
settings.  Recognition  can  be  aided  through  minimization  of  movement  of  all  other 
objects  in  the  field  (Singh  et  al.  2005). 

2. 2. 5. 3  Adding  Speech  Recognition 

Speech  recognition  has  been  used  to  facilitate  gesture  systems  of  all  types.  Several 
Department  of  Defense  efforts  are  focused  on  integration  of  gestures  with  speech 
for  robot  control.  While  there  are  advantages  to  the  independent  use  of  gesture- 
based  control,  there  are  also  advantages  that  have  been  demonstrated  when  gestures 
are  integrated  with  speech  to  achieve  communication  that  is  both  naturalistic  and 
precise.  Speech  is  a  very  natural  means  of  communication  and  is  often  the  preferred 
modality  of  control  (Beer  et  al.  2012).  Progress  toward  natural  speech  control  has 
been  demonstrated  for  situations  involving  autonomous  ground  robots 
(Schermerhom  2011;  Hill  et  al.  2015). 

Jones  (2007)  combined  camera-based  recognition  based  on  analysis  of  3-D  point 
clouds  with  speech  recognition.  Robot  capabilities  include  that  of  person¬ 
following,  gesture  recognition  based  on  silhouette  and  3-D  head  position  (“wait- 
follow”,  “breach  doorway”),  and  speech-based  controls  (“turn  back”,  “follow 
little/big”).  These  capabilities  were  demonstrated  in  moderately  noisy 
environments  where  both  the  robot  and  operator  were  moving.  Speech  commands 
enabled  control  when  the  robot  and  operator  were  out  of  line  of  sight. 
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2. 2. 5. 4  Adding  a  Handheld  Pointing  Device 

Natural  speech,  used  alone,  can  be  ambiguous  (e.g.,  “go  over  there”)  and  needs 
some  means  of  conveying  details  such  as  direction,  either  through  further  speech- 
based  clarification,  a  PDA  visual  display-based  map,  a  laser  pointer,  or  a  gesture, 
to  address  the  ambiguity  in  deictic  elements  (e.g.,  “this”,  “that”,  “there”,  etc.) 
(Perzanowski  et  al.  1998,  2000a,  2000b). 

3.  Wearable  Instrumented  Systems 

An  alternative  to  camera-based  systems  for  gesture-based  controls  is  that  of 
wearable  instrumented  systems.  The  most  common  technology  used  for  gestures  is 
instrumented  gloves,  but  a  few  other  approaches  have  been  identified,  as  described 
in  the  following  subsections.  In  this  section,  we  describe  the  2  main  approaches  to 
wearable  gestural  systems  (e.g.,  instrumented  gloves,  electromyographic  [EMG] 
sleeves)  and  approaches  to  gesture  recognition  using  these  systems.  We  then 
describe  some  military  applications  using  these  systems.  While  the  focus  is  on  robot 
control,  we  include  some  other  applications,  such  as  aircraft  control  and  dismount 
communications.  There  are  also  applications  developing  within  cockpit-type 
environments  (Brown  et  al.  2011;  DeVries  et  al.  2012;  Slade  and  Bowman  2011); 
however,  details  are  limited. 

3.1  Wearable  Instrumented  Gloves 

Instrumented  gloves  are  the  most  common  instantiation  of  wearable  instrumented 
systems  for  robot  control.  The  glove  concept  is  congruent  for  many  work  situations 
where  operators  may  already  have  to  wear  gloves.  Early  versions  of  these  gloves 
were  integrated  for  computer  usage,  in  that  the  gloves  could  be  used  for  computer 
interface  actions  such  as  menu  selection.  However,  the  reliance  on  a  visual  display 
was  somewhat  detrimental  to  performance  (Kenn  et  al.  2007).  For  robot  control, 
glove-based  approaches  are  usually  stand-alone,  with  the  glove  sending  signals  to 
robotic  intelligence  software  for  recognition,  interpretation,  and  translation  into 
computationally  understandable  and  executable  robotic  behaviors. 

Earlier  instrumented  gloves  relied  on  sensors  such  as  bend  sensors,  which  react  to 
changes  of  finger  angles,  and  sensor  electrodes  (Karlsson  et  al.  1998).  Others  use 
touch  sensors,  magnetic  trackers,  embedded  accelerometers,  and  electromagnetic 
position  sensors  with  multiple  degrees  of  freedom  to  convey  information  that  is 
then  mathematically  interpreted.  Optical  fiber  sensors  have  also  been  used  to  detect 
angular  displacements  of  finger  joints  (Fujiwara  et  al.  2013).  Iba  et  al.  (1999) 
described  the  integration  of  a  Cyberglove,  a  6DOF  position  sensor  to  determine 
wrist  position  and  orientation,  a  mobile  robot,  and  a  geolocation  system  that  tracks 
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the  robot  location.  The  operator  may  give  specific  commands  to  the  robot  or  wave 
in  the  direction  the  robot  is  to  move  (e.g.,  to  the  left  or  right).  HMM  algorithms  are 
used  for  gesture  recognition.  Multiple  users  were  used  to  train  the  6  commands. 
Recognition  under  ideal  conditions  was  very  accurate  (96%-100%),  with  fewer 
false  positives  associated  when  there  was  a  “wait”  state  in  the  gesture,  such  that  the 
gestures  were  more  easily  distinguished. 

Barber  et  al.  (2013)  developed  an  instrumented  glove  combined  with  a  9-axis 
inertial  measurement  unit  (IMU)  for  classification  of  21  unique  hand  and  ann 
gestures  that  were  from  the  Army  Field  Manual  (Headquarters,  Department  of  the 
Anny  1987)  and  modified  Palm  Graffiti.  The  wireless  gesture -recognition  glove 
incorporated  bend  sensors  to  distinguish  pointing  gestures  and  open/closed  fist  for 
detennining  the  start/end  of  a  gesture.  They  reported  a  98%  accuracy  using  a 
modified  handwriting  recognition  statistical  algorithm.  The  same  algorithm  was 
tested  against  a  handheld  device  (Nintendo  Wiimote),  which  also  demonstrated  an 
accuracy  of  96%  on  the  same  set  of  gestures.  Hill  et  al.  (2015)  demonstrated  the 
use  of  this  same  system  in  an  outdoor  field  assessment  where  arm  and  hand  gestures 
were  used  to  pause,  resume,  and  direct  control  (e.g.,  move  forward)  a  mobile  robot. 

While  instrumented  gloves  in  general  have  core  traits  in  common,  each  approach 
to  design  has  advantages  and  disadvantages  specific  to  particular  task  demands.  For 
example,  bend  sensors  can  be  fragile  in  adverse  environments  or  repeated  use,  and 
can  only  track  static  postures.  Finger-touch  sensors  may  move  out  of  position  over 
time.  Accelerometers  on  fingertips  and/or  the  back  of  the  hand  rely  on  precise 
finger  movements  and  finger  and  hand  position.  Piezo  sensors  have  also  been  used 
(Hu  et  al.  2009).  While  several  types  of  gloves  can  accommodate  static  hand 
postures  (positioning),  dynamic  hand  and  ann  movements  are  more  challenging  or 
are  not  possible  for  effective  gesture  recognition  by  some  of  these  instrumented 
systems.  In  short,  the  findings  from  one  type  of  glove  do  not  necessarily  generalize 
to  other  types  of  gloves;  they  are  not  equal  in  capability  or  usability. 

3.2  Other  Wearable  Sensors 

While  gloves  are  the  more  common  approach  to  wearable  sensors  for  gesture 
recognition,  other  types  of  wearable  sensors  have  also  been  developed.  Wu  et  al. 
(2010)  used  a  3-axis  accelerometer  mounted  to  the  user’s  wrist  to  record  hand 
trajectories,  which  were  then  classified  as  1  of  6  commands  (“turn  right/left”,  “go 
straight”,  “go  back”,  “rotate”,  “stop”).  They  reported  92%  accuracy  using  the  DTW 
recognition  algorithms.  Lementec  and  Bajcsy  (2004)  used  an  array  of  wearable 
sensors  to  detect  arm  orientations.  Yan  et  al.  (2012)  reported  success  with  body- 
worn  sensors  (e.g.,  ann  tape,  wrist  harness,  thoracic  and  pelvic  orientation  sensors) 
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as  well  as  with  3-D  (i.e.,  Kinect)  camera-based  systems.  While  both  the  body-wom 
and  camera-based  systems  were  equally  effective,  there  were  additional  constraints 
with  the  camera-based  system  beyond  those  of  the  body-wom  sensors.  For  the 
camera-based  systems,  the  controller  must  always  face  the  camera  and  stay  in  a 
certain  distance  from  the  camera. 

There  is  an  emerging  alternate  approach  for  gesture  control  that  is  based  on 
recognition  of  EMG  signals,  where  EMG  activity  is  recorded  from  forearm 
locations.  In  the  field  of  prosthetics,  EMG  signals  have  been  used  for  simple  binary 
commands;  however,  advancements  are  enabling  more  complex  coding  based  on 
postures.  Crawford  et  al.  (2005)  described  how  they  used  this  approach  to  control 
a  robotic  ann  with  gripper  functions,  from  static  hand  postures.  Electrodes  were 
placed  on  7  foreann  locations  and  one  on  the  upper  ann.  Features  were  extracted 
for  each  person  (N  =  3),  for  each  of  8  hand  postures,  over  5  sessions.  In  each 
session,  the  subject  maintained  each  of  the  8  postures  for  10  s.  Robotic  tasks  ranged 
from  simple  (e.g.,  “move  the  arm  to  topple  blocks”)  to  more  challenging  (e.g.,  “pick 
up  a  designated  object  and  drop  in  specified  bin”).  They  reported  classification 
accuracies  over  90%.  Progress  has  been  reported  with  regard  to  adaptability  to 
power  source  and  reduction  in  muscle  fatigue  (Li  et  al.  2011)  and  more 
generalizable  recognition  of  signals  across  multiple  users  (Matsubara  and 
Morimoto  2013). 

The  JPL  Biosleeve,  developed  at  the  Advanced  Robotics  Controls  group  at  the  Jet 
Propulsion  Laboratory  (California  Institute  of  Technology)  also  uses  this  approach. 
Eight  bipolar  EMG  sensors  and  an  IMU)  are  mounted  in  a  wearable  sleeve  that 
monitors  foreann  muscles.  The  Biosleeve  development  is  focused  toward  National 
Aeronautics  and  Space  Administration  space  missions  to  enhance  telepresence 
control  and  for  astronauts  working  side  by  side  with  robots  (Assad  et  al.  2012;  Wolf 
et  al.  2013). 

The  concept  is  particularly  apt  for  astronaut  use  outside  the  vehicle,  as  they  are 
encumbered  with  heavy  gloves,  making  traditional  interfaces,  such  as  joystick  or 
touch  interfaces,  less  effective.  The  sleeve  is  programmed  to  a  particular  user  to 
develop  recognition  for  static  and  dynamic  gestures  using  algorithms  to  match 
temporal  feature  patterns  to  known  templates.  Recognition  rates,  based  on  3 
subjects,  ranged  from  93%  to  99%  over  17  static  gestural  commands.  Recognition 
rates  based  on  one  subject  resulted  in  99%  accuracy  for  9  dynamic  gestures. 

3.3  Gesture  Recognition  by  Instrumented  Systems 

The  glove-based  approaches  have  the  same  core  challenge  of  gesture  capture  and 
coding  as  do  the  camera-based  approaches.  Gesture  recognition  analysis  methods 
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for  instrumented  gloves  overlap  with  approaches  taken  with  camera-based  systems: 
HMMs  (Rabiner  and  Juang,  1986),  FSM  (Hong  et  al.  2000),  DTW  (Hu  et  al.  2009; 
Wu  et  al.  2010),  and  ANNs  (Oz  and  Leu  2007).  Hand  and  body  gestures  can  be 
transmitted  from  a  controller  mechanism  that  contains  IMU  sensors  to  sense 
rotation  and  acceleration  of  movement.  HMMs  have  the  ability  to  model  sequential 
information  and  have  been  used  dominantly  throughout  the  past  decade  (Ong  and 
Ranganath  2005).  In  previous  years,  HMMs  were  used  to  recognize  ASL  with  real¬ 
time  color-based  hand  tracking  (Stamer  and  Pentland  1996).  However,  HMM  has 
a  relatively  long  training  time  with  a  large  amount  of  training  data.  Others  have 
found  that  the  DTW  approach  is  as  effective  as  HMM,  with  a  small  set  of  gestures, 
while  taking  less  time  to  prepare  (Wu  et  al.  2010).  Huang  and  associates  (2011) 
describe  a  somewhat  different  approach  to  glove-based  recognition,  which  abstracts 
a  concept  from  raw  data  to  recognize  clusters  of  data.  While  the  mathematic 
algorithms  differ,  there  are  still  the  same  basic  principles  that  underlie  algorithmic 
recognition  strategies  to  parse  gestural  movements  into  meaningful  commands. 

Earlier  versions  of  wearable  instrumented  gloves  relied  on  recognition  of  static 
hand  gestures,  as  extrapolated  by  the  glove  sensors  with  regard  to  hand  posture. 
Gesture  recognition  was  based  on  specific  features  of  a  hand  posture  that  may  be 
somewhat  artificial  and  difficult  to  replicate  across  different  users.  While  intuitive 
in  nature,  and  appreciated  by  users,  the  application  of  glove-based  control  can  be 
challenging  for  robot  control.  Kenn  et  al.  (2007)  found  that  users  would  naturally 
look  to  the  robot  instead  of  their  hands,  thus  allowing  more  attention  to  robot 
perfonnance.  However,  the  gestural  commands  were  sometimes  misinterpreted, 
and  these  misinterpretations  had  more  negative  consequence  to  robot  control 
compared  with  other  purposes,  such  as  using  gestures  to  control  a  presentation 
application.  The  negative  consequences  include  robot  movements  that  keep  going 
if  a  stop  command  is  not  recognized,  or  robots  that  actuate  in  response  to  an 
unintended  command.  With  human-robot  situations,  the  tolerance  for  error  may  be 
very  low. 

Recent  applications  of  the  instrumented  glove  for  robot  control  have  been  reported 
as  successful.  Boonpinon  and  Sudsang  (2008)  used  a  data  glove  to  command 
multiple  robots.  More-recent  technology  adds  the  capability  for  more  dynamic, 
movement-based  gestures.  IMU  sensor  technologies  placed  on  the  body  provide  an 
alternative  approach  to  gesture  recognition  within  uncontrolled  environments.  Iba 
et  al.’s  (1999)  Cyberglove  measures  18  joint  angles  of  the  fingers  and  wrist,  with  a 
6DOF  positioning  system  to  determine  wrist  position  and  orientation.  The  system 
was  used  to  recognize  6  gestures,  based  on  a)  hand  opening,  b)  flat  open  hand,  c) 
hand  closing,  d)  index  finger  pointing,  e)  waving  fingers  to  the  left,  f)  waving 
fingers  to  the  right,  and  g)  none  of  the  above.  Kang  et  al.  (2010)  used  a  version  of 
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the  Cyberglove  having  a  3 -axis  accelerometer,  a  3 -axis  gyroscope,  and  a  3 -axis 
magnetometer.  They  developed  a  data  fusion  approach  to  integrate  findings, 
consisting  of  5  substeps.  They  reported  an  average  of  94%  recognition  rate  for  10 
gestural  commands.  Rates  were  somewhat  lower  for  12  commands  (91.7%)  and  14 
commands  (86.5%).  Jin  and  his  associates  (2011)  used  an  instrumented  glove 
orientation  sensor  to  recognize  static  hand  commands. 

3.4  Military  Applications:  Instrumented  Systems 


3.4.1  Dismount  Soldier  Communications 

Sullivan  et  al.  (2010)  described  progress  toward  a  DARPA-sponsored  goal  for  a 
Soldier  ensemble  incorporating  gesture  recognition,  head  tracking,  laser 
rangefinding,  and  augmented  reality  displays  to  support  shared  local  situation 
awareness.  Objects  marked  by  one  Soldier  using  gestures  are  portrayed  to  other 
Soldiers  and  shown  as  overlays  in  wearable  visual  displays.  The  gestural  system 
included  hand  and  upper  arm  gesture  sensors  that  work  with  Soldiers  wearing 
gloves. 

Ellen  et  al.  (2010)  offer  a  concept  of  operations  for  use  of  wireless  communication 
gloves  to  communicate,  plan,  and  react  while  wearing  Mission-oriented  Protective 
Posture  (MOPP)  gear  (i.e.,  protective  ensemble  for  chemical  and  biologic  threats) 
(Ceruti  et  al.  2009;  Defense  Tech  Briefs  2010).  While  smartphone  applications  can 
be  useful  for  Soldier  use,  touchscreen  controls  are  not  easily  accomplished  while 
wearing  MOPP  gear  that  includes  heavy  gloves.  The  gloves  have  magnetic  and 
motion  sensors  for  gesture  recognition.  Additional  glove-based  sensors  would 
provide  information  regarding  chemical  and  biological  threats  in  the  immediate 
area.  Other  sensors  in  this  concept  include  GPS,  physiological  sensors,  and  video 
imagery.  Researchers  from  this  group  also  demonstrated  the  wireless  glove  for 
robot  control  (Tran  et  al.  2009).  In  this  effort,  glove-based  sensors  sent  signals  to  a 
processing  unit  worn  on  the  foreann,  which  then  sent  signals  to  a  TALON  robot. 
Commands  were  used  to  control  robot  position,  camera,  claw,  and  arm,  such  as 
“point  camera”  and  “grab  object”. 

AnthroTronix  has  demonstrated  (IMU-based  hand  and  arm  signal  gesture 
recognition  accuracy  of  100%  (Vice  et  al.  2001)  via  a  custom  instrumented  glove 
interface.  A  more  recent  effort  developed  an  instrumented  glove  for  robot  control 
and  for  hand  and  arm  gestures  for  Soldier  communications.  Robot  control  was 
accomplished  through  navigation  cues  via  static  hand  gestures  (e.g.,  “move 
forward”,  “move  backward”,  “turn  left”,  “turn  right”,  “stop”).  Communication  with 
other  Soldiers  was  sent  through  hand  and  arm  gestures  (i.e.,  freeze,  rally,  danger, 
double  time)  that  were  received  through  signal  patterns  via  a  tactile  vest.  The  tactile 
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vest  allowed  more  immediate  perception  and  understanding  compared  with 
standard  visually  communicated  hand  and  arm  signals  (Elliott  et  al.  2014). 

The  US  Army  Research  Laboratory  and  the  University  of  Central  Florida  have 
demonstrated  an  IMU-based  hand  and  ann  signal  gesture  recognition  system  for 
interactions  with  an  autonomous  ground  robot  (Hill  et  al.  2015;  Barber  et  al.  2015). 
The  IMU-based  instrumented  glove  recognized  8  dynamic  hand  and  ann  gestures 
to  issue  commands  to  the  robot.  Integrated  within  a  multimodal  interface  that  also 
used  speech  for  verbal  commands,  the  user  could  perform  a  gesture  to  pause  or 
resume  autonomous  navigation  of  the  robot  in  addition  to  simple  drive  commands 
(e.g.,  “move  forward”,  “move  backward”,  “turn  left”). 

Calvo  et  al.  (2012)  developed  and  evaluated  a  pointing  device  embedded  within  a 
tactical  glove  nonnally  worn  by  US  Air  Force  Battlefield  Ainnen  during  dismount 
military  operations.  Battlefield  Ainnen  are  the  special  operations  force  of  the  Air 
Force.  The  system  detects  the  user’s  wrist  and  finger  movements  through  a  3-axis 
gyroscopic  sensor.  This  capability  was  primarily  explored  for  controlling  cursor 
movements  on  a  handheld  device,  as  compared  with  using  a  touchpad  or  TrackPoint 
device.  Moving  the  sensor  with  directional  hand  movements  proportionally  moves 
the  cursor  in  the  same  direction.  The  user  can  also  use  their  thumb  to  press  buttons 
placed  on  the  side  of  the  index  finger,  to  enable  cursor  movement,  and  perform 
cursor  left  clicks.  While  training  time  to  reach  asymptotic  perfonnance  was  longest 
with  the  glove  than  with  the  touchpad  or  TrackPoint  device,  throughput 
perfonnance  was  significantly  higher  with  the  glove  compared  with  the  touchpad, 
which  in  turn,  was  significantly  higher  than  the  TrackPoint.  Movement  times  were 
lowest  with  the  glove;  however,  the  glove  was  associated  with  higher  error.  This 
glove  concept  was  compared  to  a  handheld  controller  for  micro-UAV  control  to 
navigate  waypoints.  Operators  included  6  nonmilitary  subjects  perfonning  in  a 
simulation-based  training  environment.  Waypoints  were  presented  as  red  cubes  in 
a  3-D  virtual  environment.  Performance  was  not  significantly  different  from  the 
handheld  controller.  These  results  are  quite  promising  and  warrant  further 
investigations  using  military  subjects. 

3.4.2  Robot  Control 

3. 4. 2.1  AnthroTronix 

Figure  2  shows  Soldiers  using  an  instrumented  glove  for  robot  control  and  for  hand- 
arm  communications  (Elliott  et  al.  2014). 
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Fig.  2  Soldier  using  instrumented  glove  for  robot  control  (left)  and  communications  (right) 
3. 4. 2. 2  SA  Photonics 

Researchers  at  the  Space  and  Naval  Warfare  Systems  Center  (Ceruti  et  al.  2009; 
Tran  et  al.  2009)  demonstrated  application  of  a  wireless  communications  glove  for 
robot  control  and  communications,  along  with  other  tasks,  such  as  motion  tracking, 
gesture  recognition,  data  transmission,  and  reception,  in  nonnal  and  in  extreme 
environments.  They  compared  glove  prototypes  with  and  without  a  haptic  (i.e., 
vibrotactile)  capability  for  signal  feedback  and  covert  signal  reception.  Both  types 
of  gloves  can  be  worn  under  a  space  suit  or  chemical  protective  gear  and  still 
operate  effectively. 

3.4.3  Aircraft  Direction 

Urban  et  al.  (2004)  used  a  motion  tracker  along  with  wearable  sensors.  Sensors 
were  attached  to  armbands  worn  by  the  operator.  Two  sensors  on  each  arm  (i.e., 
upper  and  lower  arm)  were  determined  to  be  sufficient  for  most  gestures  in  the 
Navy  gesture  lexicon  for  control  of  aircraft  on  the  ground.  While  gestures  were 
accurately  recognized,  there  were  problems  associated  with  wired  sensors  (e.g., 
tangled  wires,  necessity  of  being  near  the  base  device)  and  fatigue  associated  with 
bulky  sensors,  indicating  the  need  for  smaller,  wireless  sensors. 

Research  and  development  of  gestural  communications  for  military  use  highlight 
positive  potential  for  successful  use  and  the  limitations  that  must  be  addressed.  The 
variation  of  gestural  command  approaches  represents  a  corresponding  variety  of 
advantages  and  disadvantages.  Identification  of  promising  technology  for  a 
particular  use  should  begin  with  consideration  of  purpose  through  analyses  of 
operational  requirements,  task  demands,  and  situational  constraints. 
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3.4.4  Constraints  in  Military  Operations 

Some  characteristics  constrain  the  use  of  wearable  sensors  such  as  instrumented 
gloves.  As  identified  in  the  following  bullets,  while  some  of  the  wearable 
instrumented  systems  can  perfonn  well  in  controlled  environments,  military 
operational  environments  may  prove  challenging  for  the  current  state-of-the-art 
technology. 

•  Operational  range.  The  instrumented  gloves/sensors  must  be  used  within 
system  range.  At  this  time,  technology  improvements  to  the  electronic 
transmission  capabilities  (for  sender  and  receiver)  will  be  needed  for  longer 
distances. 

•  Latency.  Latency  of  signals  may  be  a  problem  when  operational  tempo  is 
fast  and  movements  quick. 

•  Multiple  instrumented  gloves/sensors.  There  may  be  networking  limitations 
on  the  integration  of  multiple  wearable  controllers. 

•  Pointing  gestures.  Pointing  has  not  yet  been  well-achieved  while  using 
wearable  instrumented  devices  for  gestures.  Certainly  part  of  the  issue  is 
that  it  is  difficult  to  know  what  a  pointing  gesture  is  directed  toward — what 
is  being  pointed  at?  Current  technology  may  be  able  to  be  augmented  with 
speech  recognition. 

4.  Discussion 


In  our  overview,  we  focused  on  hand  and  arm  gestures,  as  they  are  most  commonly 
used.  For  example,  Mitra  and  Acharya  (2007)  identify  90%  of  gesture 
communication  conveyed  through  hands  and  arms.  Also,  Soldiers  already  have 
familiarity  of  hand  and  arm  signals  for  communication,  some  of  which  could  also 
be  considered  as  gestural  commands  to  robots.  The  use  of  gestures  is  already 
common  in  military  tactics  and  communication,  as  they  have  the  advantage  of  being 
silent  and  easily  used.  Military  hand  and  arm  signals  are  documented  in  sources 
such  as  the  US  Army  Field  Manual  No.  21-60  (Headquarters,  Department  of  the 
Army  1987)  and  Marine  Rifle  Squad  MWCP  3-1 1.2  (Headquarters,  Department  of 
the  Navy  2002).  In  addition,  certain  mission  scenarios  may  have  further 
requirements,  such  as  a  need  for  covert  operations  (e.g.,  low  noise  and  electronic 
transmissions)  or  combat  operations  that  are  characterized  by  high  stress,  high  time 
pressure,  high  noise,  low  visibility,  or  night  operations.  Task  demands  of  military 
ground  robots  will  vary  depending  on  mission  requirements,  such  that  the  user  may 
have  to  control  robot  movements,  camera  actions  (e.g.,  pan,  zoom,  take  pictures), 
or  robot  manipulations  (e.g.,  grasping  and  manipulating  objects). 
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The  ease  of  use  expected  from  gestures  illustrates  a  core  goal  for  effective  use  of 
gestural  commands  across  many  situations:  that  they  be  intuitive  and  easy  to  learn; 
ideally  to  the  point  of  being  discoverable  (i.e.,  the  user  discovers  the  gestural 
command  as  a  natural  consequence  of  interaction).  The  best,  most  natural  designs 
are  those  that  are  intuitive,  that  match  the  behavior  of  the  system  to  the  gestures 
humans  would  naturally  use  to  accomplish  the  behavior,  such  as  putting  your  hands 
under  a  faucet  to  turn  it  on,  walking  through  a  doorway,  or  pointing  to  direct 
attention  to  a  location.  Task  context  must  also  be  a  primary  consideration  in  gesture 
evaluation:  some  gestures  may  be  intuitive,  easily  learned,  and  performed  for  a 
single  command  but  would  be  inappropriate  for  sustained  repetitive  use.  Some 
situations  will  be  more  demanding  on  the  user  and  the  technology,  such  that  each 
technology  must  be  assessed  in  light  of  the  intended  use  and  users.  However, 
general  principles  for  gestural  commands  would  include  the  following  design  goals 
to  achieve  high  user  acceptance,  arising  from  perceived  usefulness  along  with  ease 
of  use  (Davis  1989): 

•  Easy  to  leam 

•  Easy  to  perfonn 

•  “Natural”  movements 

•  Easy  to  remember 

•  Easily  distinguishable  gestures  (from  other  gestures) 

•  Easily  distinguished  from  normal  work  movements 

•  Easily  distinguished  from  other  people  in  surrounding  area 

•  Easily  distinguished  from  normal  social  gesticulations 

•  Easily  recognized  within  a  specified  range  (relevant  to  task) 

•  Socially  acceptable 

•  On/off  control 

•  Compatible  with  and  supplemental  to  existing  hand  and  arm  gestures 

While  these  characteristics  seem  self-evident,  embedded  within  are  issues  with 
regard  to  measurement  of  these  concepts.  For  example,  concepts  such  as 
“intuitiveness”  and  “comfort”  are  subjective  and  perhaps  somewhat  ambiguous, 
although  efforts  have  been  made  toward  a  more-quantitative  approach  to 
measurement  (Stern  et  al.  2006).  In  addition,  there  are  situational  moderators,  such 
that  one  solution  does  not  fit  all.  For  example,  ease  of  learning  and  recognition  can 
be  significantly  affected  not  only  by  the  nature  of  the  gesture(s)  themselves,  but 
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also  by  the  number  of  gestures  that  are  needed.  Certainly,  it  is  more  difficult  to 
leam,  remember,  and  distinguish  a  large  number  of  gestures.  Comfort  and  ease  of 
perfonnance  will  also  be  greatly  affected  by  gesture  duration  and  use  over  time, 
and  whether  commands  must  be  made  while  on  the  move.  Social  acceptance  will 
also  be  affected  by  the  environment;  for  example,  speech-based  controls  would  not 
be  as  appropriate  for  use  in  a  quiet  situation  such  as  a  library.  Similarly,  broad  active 
gesticulations  would  not  be  appropriate  for  covert  Soldier  missions,  such  as  an 
ambush.  Designers  must  systematically  analyze  situation  task  demands  to  best 
tailor  gestures  (e.g.,  static  hand  postures,  dynamic  hand  movements,  etc.)  with 
gestural  recognition  capabilities,  environmental  constraints,  and  need  for  precision 
(e.g.,  consequences  of  error  or  repetition).  Given  this,  we  turn  attention  to  issues 
relevant  to  Soldier  use  and  performance. 

4.1  Issues  Relevant  to  Soldier  Use 

It  is  evident  that  the  field  of  gestural  controls  for  robots  has  shown  promise  along 
a  number  of  task  demands,  ranging  from  simple  commands  to  more  complex  and 
autonomous  tactical  commands.  Given  that  Soldier  tasks  can  also  vary,  along  with 
environmental  demands  and  constraints,  we  identify  the  following  mission  and 
task-related  factors  for  consideration  when  designing  gestural  controls  for  Soldier 
use. 

4.1.1  Line  of  Sight/Distance  between  Robot  and  User 

Many  dismount  Soldier  missions  are  conducted  with  intact  squads,  where  Soldiers 
are  within  line  of  sight.  Some  robot  roles  would  normally  take  place  in  proximity 
to  squad  members,  such  as  robotic  mules  to  offload  weight.  While  the  constraints 
associated  with  the  need  for  line  of  sight  seems  more  applicable  to  camera-based 
systems,  it  can  also  be  relevant  to  instrumented  systems,  because  of  the  need  for 
appropriate  range  for  wireless  signals.  The  need  for  line  of  sight  or  close  range  can 
affect  the  nature  of  gestures  to  be  used.  For  example,  larger,  more  whole-body 
dynamic  movements  may  be  more  effective  for  camera-based  systems.  The  need 
for  direct  line  of  sight  with  camera  systems  could  affect  military  operations  such  as 
room  clearing,  reconnaissance,  ordnance  removal,  and  other  situations  where  the 
robot  may  be  distant  and/or  out  of  line  of  sight.  Perhaps  most  important,  the  need 
for  line-of-sight  operations  requires  the  Soldier-gesturer  to  be  exposed,  making  the 
gesturer  more  vulnerable  and  potentially  reducing  survivability. 

4.1.2  Visibility/Night  Operations 

For  camera-based  systems,  the  issues  associated  with  the  need  for  visibility  is 
similar  to  that  of  line  of  sight  and  range.  Camera-based  systems  that  cannot  be 
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effective  in  smoke,  fog,  mist,  or  night  operations  will  be  greatly  limited  in  tactical 
applications.  To  be  effective  in  these  low-visibility  conditions,  camera-based 
systems  will  need  to  be  augmented  with  specialized  capabilities  such  as  night-  or 
thennal-vision  capabilities.  (Research  and  progress  in  this  area  warrant  greater 
focus  with  regard  to  engineering  and  human  factors).  At  this  time,  wearable 
instrumented  systems  in  general  have  greater  advantage  for  limited  visibility 
operations. 

4.1.3  Practicality  of  Using  Special  "Markers" 

Some  camera  image  recognition  systems  are  augmented  by  the  use  of  special 
clothing  or  wearable  markers,  which  ease  the  processing  complexity  and  load. 
Situational  task  demands  will  greatly  affect  whether  this  approach  can  be  used.  It 
should  be  noted  that  the  special  clothing  or  markers  that  make  the  user  more  visible 
and  recognizable  to  the  camera  would  not  likely  be  suited  for  covert  operations 
where  Soldiers  strive  for  discretion  and  secrecy.  Special  clothing/markers  can  make 
the  Soldier  a  higher-risk  target  and  more  likely  to  be  engaged  by  enemy  forces,  thus 
survivability  may  be  compromised. 

4.1.4  Practicality  of  Using  Wireless  Transmissions 

Some  operational  missions  may  require  silence,  not  only  for  audio,  but  also  for 
electronic  transmissions.  Other  missions  may  be  more  susceptible  to  jamming  of 
such  electronic  transmissions.  While  camera  systems  are  not  vulnerable  to  jamming 
transmissions,  the  instrumented  wearable  systems  would  be  rendered  ineffective  if 
transmissions  were  jammed. 

4.1.5  Need  for  Deictic  Information:  Pointing  Gestures 

Deictic  infonnation  depends  on  knowledge  of  situation  context  and  personnel 
locations.  The  receiver  robot  must  know  where  the  operator  is,  as  well  as  the 
direction  of  the  pointing  gesture.  While  pointing  gestures  are  intuitive  and  perhaps 
the  most  useful  of  gestures,  they  are  also  quite  difficult  to  instantiate  successfully, 
for  both  camera-based  and  instrumented  systems.  If  there  is  a  need  for  deictic 
information,  specifications  should  be  generated  with  regard  to  the  nature  of  the 
pointing  gesture.  For  example,  will  the  gesture  indicate  a  general  area?  If  so,  how 
large  is  this  area?  Will  the  gesture  be  augmented  by  speech  and  object  recognition 
(e.g.,  “bring  me  that  ammo  pack”)  or  integrated  with  map-based  infonnation  on  a 
visual  display?  The  multimodality  of  the  entire  system  will  greatly  affect  the  utility 
of  pointing  gestures  in  general. 

The  effectiveness  of  any  gesture  set  will  be  affected  by  the  nature  of  its  multimodal 
context.  Gestures  may  be  added  to  augment  an  existing  system  to  increase 
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intelligibility  in  noisy  environments.  In  the  same  way,  speech  can  augment  a  vision- 
based  gestural  system  when  visibility  is  occluded.  It  is  not  likely  at  this  time  that  a 
gestural  command  system  would  be  completely  stand-alone;  instead,  it  will  serve 
to  augment  and  complement  speech,  keyboard,  and  or  visual  map  displays. 
Scenario-based  cognitive  task  analysis  approaches  including  these  issues  should 
drive  development  of  new  Soldier  concepts  (Hoffman  and  Elliott  2010). 

4.1.6  Ease  of  Producing  and  Sustaining  Gestures 

Situations  should  be  analyzed  with  regard  to  the  frequency  and  duration  of  gestural 
use.  Gestures  that  may  be  easy  to  produce  for  a  short  period  may  become  fatiguing 
over  time.  For  example,  robot  controllers  found  that  holding  their  hand  steady  and 
parallel  to  the  ground  was  at  first  quite  easy  to  accomplish  but  soon  became  fatigued 
after  only  15  min.  Thus,  a  relevant  issue  is  whether  the  gestures  have  to  be 
maintained  continuously.  This  issue  pertains  to  both  camera-based  and 
instrumented  gestural  systems. 

4.1.7  Number  of  Gestures 

The  number  of  gestural  commands  that  are  needed  for  robot  control  will  greatly 
affect  the  ease  of  use  for  the  operator.  Given  a  situation  that  requires  more  than 
5-7  commands,  particular  attention  must  be  given  to  gesture  distinguishability,  as 
well  as  ease  of  producing  the  gesture.  As  the  number  of  gestural  commands 
becomes  higher,  there  should  be  greater  consideration  of  gestures  that  use 
movement  as  well  as  static  postures,  including  arm  and/or  whole  body  movements, 
to  ease  recognition.  As  the  number  of  gestures  increases,  the  training  takes  longer, 
the  difference  between  gestures  may  be  less  than  optimal,  and  the  robot  may  not  be 
able  to  detennine  differences  at  increased  operating  distances.  This  issue  pertains 
to  both  camera-based  and  instrumented  systems. 

4.1.8  Number  of  Robots  to  Be  Controlled  by  One  Person 

Gestures  have  been  used  to  coordinate  fonnation  and  movement  of  multiple  robots. 
For  example,  an  instrumented  glove  has  been  used  to  facilitate  coordination  of  robot 
behaviors  in  multirobot  systems  having  inherent  decision  rules  to  stay  together, 
according  to  assigned  distance  parameters  (Boonpinon  and  Sudsang  2008).  In  such 
a  situation,  the  approach  to  gesture  control  is  somewhat  similar  to  the  gestures  and 
whistles  used  by  herders  to  command  sheep  and/or  herding  dogs  (e.g.,  the  sheep 
have  natural  tendencies  for  formation  and  reaction  to  command  stimuli  [Phillips  et 
al.  2012]).  Multirobot  situations  must  be  carefully  considered  to  match  gestural 
commands  to  robot  capabilities  (e.g.,  level  of  autonomy  and  group  fonnation 
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algorithms)  and  robot  task  demands,  at  both  the  individual  and  multiple-robot  level 
of  analysis.  This  issue  pertains  to  both  camera-based  and  instrumented  systems. 


4.1.9  Number  of  Coordinating  Soldier  Users 

As  robot-Soldier  scenarios  become  more  complex,  there  may  be  situations  when 
the  robot  is  considered  a  squad-level  asset,  much  like  any  other  squad  member.  In 
that  case,  careful  attention  must  be  given  to  provide  not  only  the  capabilities  but 
the  policies  for  squad-level  use  (i.e.,  the  tactics,  techniques,  and  procedures  [TTPs] 
to  be  used  by  the  Soldiers).  When  such  TTPs  are  developed,  the  robot  could,  and 
should  participate  as  an  integral  member  of  the  squad,  in  training  exercises  (i.e., 
battle  drills),  to  best  ensure  perfonnance  that  is  intuitive,  efficient,  and  effective.  In 
addition  to  consideration  of  TTPs,  systems  should  be  programmable  for  accepting 
unit  standard  operating  procedures,  which  differ  from  unit  to  unit.  This  issue 
pertains  to  both  camera-based  and  instrumented  systems. 

4.1.10  Combat  Readiness 

There  are  also  the  usual  factors  that  must  be  considered  for  any  Soldier  system. 
These  would  include  consideration  of  the  type  of  power  source,  battery  life, 
weight/bulk,  sensitivity,  accuracy,  reliability,  noise  and  light  discipline,  operational 
security,  operational  environment  (sand/dust,  rain,  terrain,  urban,  etc.),  durability, 
and  maintainability. 

In  addition  the  system  should  be  simple  to  operate,  modular  in  design  for  specific 
application  to  various  missions  or  tasks,  designed  to  deny  use  by  enemy  against 
friendly,  include  design  considerations  for  continuous  operations  missions  lasting 
from  72-96  h,  man-portable,  air-droppable,  and  have  decontamination  procedures 
for  nuclear,  biological,  and  chemical  environment  (i.e.,  disposable  vs.  able  to  be 
decontaminated). 

Table  2  lists  primary  considerations  for  gesture  technology  type  that  are  relevant  to 
use  by  Soldiers  in  operational  contexts.  The  positive  characteristics  (Pros)  and 
negative  characteristics  (Cons)  of  2  types  of  gesture  technology,  camera-based 
systems  and  instrumented  systems,  are  presented.  Throughout  the  table,  positive 
characteristics  (Pros)  for  Soldier  use  are  presented  in  plain  font,  while  negative 
characteristics  (Cons)  for  Soldier  use  are  presented  in  italics. 
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Table  2  Considerations  for  gesture  technology  relevant  to  Soldier  performance 


Consideration 


Camera-based  system 
_ (Pros/tvw.s) _ 


Instrumented  systems 
_ (Pros/cows) _ 


Camera  cannot  capture  and  •  Gesture  will  be 
interpret  command  interpreted  regardless  of 

intermittent  line  of  sight 


Need  for  line  of  sight 


Distance  between  robot  and 
user 


Need  for  visibility  during 
low-visibility  operations 


Need  for  wearable  markers 


Need  for  wireless 
transmissions 

Need  for  deictic  (pointing) 
information 


If  camera  is  too  close,  then 
it  may  miss  part  of  the 
movement  intended  for 
command/control. 

Trees,  bushes,  obstacles  can 
severely  restrict  line  of  sight 
between  operator  and  robot 
camera. 

Operator  must  remain 
exposed  in  hostile 
environment  to  control 
robot 

Camera  may  require  close 
range  for  effective  gesture 
recognition  even  when 
signaler  is  within  line  of 
sight. 

May  be  addressed  through 
specialized  vision  systems 

Daylight  camera  systems 
may  not  capture  and 
interpret  command  in  low 
visibility 

May  be  required  for 
effective  use  in  cluttered 
environments 

May  enable  targeting  of 
operator,  reducing  safety 
and  effectiveness 
Not  needed 


May  be  augmented  by 
speech,  visual  display,  or 
pointing  device  (e.g.,  laser¬ 
pointing  system) 

2-D  camera  systems  have 
difficulty  interpreting  3-D 
cue  information 


Operator  may  be  able  to 
control  robot  from  a 
covered  or  defilade 
position. 

Operator  may  be 
restricted  based  on 
dense  fol i age  from 
executing  proper 
gesture. 


Dependent  on  network 
range  capability 
(See  also  Wireless 
Transmissions  below) 

Instrumented  systems 
not  affected  by  level  of 
visibility 


Not  needed 


Depending  on  system, 
may  be  susceptible  to 
in  terference/jamm  ing 

May  be  augmented  by 
speech,  visual  display, 
or  pointing  device  (e.g., 
laser-pointing  system) 

Pointing  system  may  be 
incorporated  in  wearable 
or  handheld  systems 


Approved  for  public  release;  distribution  is  unlimited. 

35 


Table  2  Considerations  for  gesture  technology  relevant  to  Soldier  performance  (continued) 


Consideration 

Applies  to  BOTH  camera  and  instrumented  systems 

Ease  of  producing  and 
sustaining  gestures 

• 

Frequency  and  duration  of  gestures  may  impact  fatigue  and 
ability  to  hold  gesture  for  some  period  of  time 

Number  of  gestures 

• 

The  number  and  distinguishability  of  gestures  will  greater 
affect  the  ease  of  use  by  Soldier 

Number  of  robots  to  be 
controlled  by  one  person 

• 

Careful  consideration  needs  to  be  made  for  multirobot 
systems. 

Number  of  coordinating 
Soldier  users 

• 

For  multirobot,  multi-Soldier  user  scenarios,  need  careful 
attention  to  robot  responses  to  gestures  and  policies  for 
squad-level  use. 

Combat  readiness 

• 

Important  factors  for  any  Soldier  system,  including  gesture 
technology,  include  power  source,  battery  life,  weight/bulk, 
operational  security,  operational  environment,  etc. 

4.2  Future  Directions 

4.2.1  Integration  with  Speech 

Given  the  relatively  hands-free  nature  of  gestural  commands  and  constraints  with 
regard  to  command  vocabulary  and  syntax,  it  is  likely  that  a  combination  with 
speech  control  would  be  beneficial.  While  control  systems  should  be  separable  as 
an  option  (e.g.,  in  circumstances  where  minimal  sound  is  required,  or  one 
component  is  not  working),  it  is  clear  that  integration  of  multiple  modalities  will 
benefit  from  respective  advantages.  The  combination  of  deictic  gestures  to  support 
human-human  interactions  has  been  well  established  (Urban  and  Bajcsy  2005). 
Most  efforts  follow  the  structure  of  “put-that-there”  system  of  Bolt  (1980)  that 
refers  to  object  and  locations  by  pointing  and  speaking.  Research  from 
neuroscience  has  argued  that  human  gesture  production  is  very  much  associated 
with  language  processing  (Kelly  et  al.  2009;  Mayberry  and  Jacques  2000)  and  are 
thus  an  intuitive  pairing.  Consistent  with  this  view,  Gullberg  (1999)  reported  that 
over  50%  of  gestures  used  in  spontaneous  gesture  production  were  to  clarify  speech 
utterances.  Studies  have  shown  demonstrable  benefits  from  use  of  gestures  to 
support  deictic  infonnation,  particular  with  regard  to  location  and  spatial 
relationships  (St  Clair  et  al.  2011b). 

Speech-based  controls  have  been  developed  with  the  goal  of  natural  language 
interaction  (Brooks  et  al.  2012).  At  this  time,  purely  speech-based  controls  face  a 
core  challenge  regarding  communication  of  spatial  relationships  and  explicit 
directions  (e.g.,  “go  to  the  east  side  of  the  third  building  behind  the  church”),  and 
pointing  gestures  are  expected  to  help  clarify  localization  information.  The 
challenge  becomes  even  greater  if  one  is  commanding  a  robot  with  multiple  degrees 
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of  freedom  in  space;  neither  speech  nor  any  programming  language  is  well  suited 
for  these  types  of  commands  (Hirzinger  2001).  Speech-based  controls  can  also  be 
problematic  in  noisy  environments  such  as  aircraft  carriers.  The  addition  of  gestures 
to  speech  control  systems  should  result  in  higher  effectiveness  based  on 
complementary  capabilities. 

Marge  et  al.  (2012)  used  a  WoZ  approach  to  investigate  relative  advantages  of  a 
traditional  handheld  robot  controller  with  video  feed,  against  alternatives  that  used 
speech  or  a  combination  of  speech  and  gesture.  The  evaluation  was  based  on  a 
Soldier  room-clearing  mission,  where  the  ground  robot  played  the  role  of  a  fellow 
Soldier  who  guards  the  hallway  and  watches  for  enemy  movement.  Twenty-one  of 
30  participants  were  active  duty  in  the  military.  In  the  traditional  handheld  display 
condition,  the  participant  used  a  tablet  computer  with  a  virtual  onscreen  joystick, 
which  also  provided  a  video  feed  from  the  robot’s  camera.  The  participant  was 
tasked  to  move  and  search  rooms  using  a  predefined  route,  look  for  and  note  certain 
objects  in  the  environment,  and  monitor  the  video  for  passing  people  (indicated  by 
a  marker  placed  in  front  of  the  camera).  In  the  speech  condition,  the  video  feed  was 
replaced  by  a  simulated  capability  where  the  robot  alerts  the  participant  through 
speech.  In  the  combined  speech  and  gesture  condition,  the  speech  replaced  the 
video  speech,  and  teleoperation  of  the  robot  was  perfonned  by  a  hand  and  arm 
gesture  to  stop  and  start  robot  “follow”  movements.  Actual  commands  were 
accomplished  through  experimenter  control  of  robot  movements,  allowing  a  more 
controlled  investigation  of  options  independent  of  other  perfonnance  factors  (e.g., 
capability  and  reliability  of  speech  and  gesture  control).  Results  indicated  that 
significantly  faster  speed  and  lower  workload  associated  with  the  speech-gesture 
combined  display,  particularly  for  the  military  participants,  who  had  less 
experience  with  robot  controllers.  Other  participants  were  the  robot  developers  and 
technicians  who  had  more  experience  with  the  handheld  controller.  Participants 
expressed  higher  preference  for  the  speech-gesture  combination,  as  they  allow 
heads-up  hands-free  control,  allowing  more  attentional  resources  to  their 
surroundings.  While  not  conclusive,  given  the  WoZ  approach,  results  are  promising 
and  support  further  development  of  actual  capability. 

Sofge  and  colleagues  discuss  their  application  of  2  cognitive  architectures  (i.e., 
ACT-R  and  Polyscheme)  to  achieve  understanding  of  natural  language  commands 
that  involve  spatial  reasoning  and  perspective-taking  (Sofge  et  al.  2003b).  Related 
efforts  are  also  focused  on  the  modeling  of  spatial  reference  in  natural  language 
(Adams  and  Skubic  2005;  Blisard  and  Skubic  2005;  Luke  et  al.  2005),  and  the 
integration  of  speech  with  pointing  gestures  (St  Clair  et  al.  2011b).  Typically,  the 
pointing  gesture  is  used  in  conjunction  with  speech  to  accomplish  labeling  of 
objects  (e.g.,  “this  is  saucerl”),  where  the  pointing  gesture  assists  in  clarifying 
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“which”  object.  Pointing  gestures  are  also  used  to  indicate  a  location  (e.g.,  “move 
50  m  over  there”  [Brooks  and  Breazeal  2006]).  Kennedy  and  Rybski  (2007) 
reported  the  integration  of  vision-based  human/limb  detection  and  tracking  with  the 
capability  for  a  human  to  teach  a  robot  how  to  do  tasks  through  speech  and  physical 
demonstration.  More  basic  research  investigates  the  natural  use  of  speech  and 
gestures  for  human-to-human  dialogs  pertaining  to  spatial  relationships  (Lucking 
et  al.  2012).  As  robots  become  more  capable  (e.g.,  intelligent,  autonomous),  the 
guidelines  pertaining  to  human-human  communications  will  become  more 
relevant.  It  is  sensible  to  conclude  that  systems  using  both  speech  and  gestures  will 
naturally  evolve  towards  more  effective  and  intuitive  robot  control  systems. 

Stiefelhagen  et  al.  (2004)  developed  a  speech-gesture  control  system  as  a  natural 
interaction  with  an  assistive  robot,  in  a  kitchen  scenario.  Speech,  head  pose,  and 
gestures  were  integrated  with  visual  recognition  technology  to  communicate 
commands  such  as  “what  is  in  the  refrigerator”,  “please  set  the  table”,  “please  turn 
off/on  the  light”,  and  “please  bring  me  a  certain  object”.  The  system  included  a 
JANUS  speech  recognition  system  developed  at  the  University  of  Karlsruhe,  3-D 
face  and  hand  tracking,  speech  synthesis,  and  a  stereo  camera  system  with  pan/tilt. 
Remote  microphones  were  used  in  lieu  of  head-mounted  alternatives,  to  minimize 
user  discomfort.  Depth  information  allowed  gesture  recognition  that  was  more 
robust  to  lighting  changes.  Head  pose  infonnation  was  used  to  signify  direction  and 
to  detennine  whether  the  user  was  directing  speech  to  the  robot  as  a  command. 
When  combined  with  pointing  gestures,  the  head  pose  information  increased 
accuracy  of  interpretation,  by  reducing  false  positive  error  rate  from  26%  to  13%. 
The  robot  visual  recognition  system  was  programmed  to  recognize  objects  such  as 
cups,  dishes,  forks,  knifes,  spoons,  and  lamps.  Thus,  combination  of  speech, 
gesture,  and  head  pose  information  allowed  interpretation  of  ambiguous  phrases 
such  as  “get  me  that  fork”  or  “switch  on  that  lamp”.  However,  perfonnance  data 
for  the  system  were  sparse  and  restricted  to  performance  of  each  component  rather 
than  the  system  as  a  whole.  The  system  was  described  as  a  work  in  progress,  with 
current  goals  toward  a  more  humanoid  robot  with  2  arms.  Similarly,  Rogalla  and 
associates  (Rogalla  et  al.  2002)  also  reported  progress  on  integrated  vision-based 
recognition  of  gestures  and  objects  with  speech-based  control,  for  assistive  tasks. 
In  their  approach,  emphasis  was  on  hand  silhouette  recognition  of  hands  with  color 
segmentation  of  objects,  combined  with  “ViaVoice”  speech  recognition  capability, 
to  accomplish  tasks  such  as  “take  the  cup  from  the  table  in  front  of  you”. 

While  benefits  have  been  demonstrated,  challenges  remain  with  regard  to  effective 
integration  of  speech  and  gesture,  particularly  in  multi-object  environments  in  a 
3-D  world.  These  more  complex  scenarios  represent  a  complex  and  unstructured 
problem  (Brooks  and  Breazeal  2006).  Similar  issues  are  faced  as  researchers  strive 
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to  develop  an  interface  that  integrates  head-mounted  visual  display,  speech,  and 
gesture  controls  for  the  commander  as  he  or  she  is  seated  within  a  moving  command 
vehicle  (Neely  et  al.  2004).  Gestures  are  most  naturally  effective  in  situations  of 
physical  co-presence,  where  the  robot  and  the  operator  can  establish  a  joint  visual 
understanding  of  the  environment,  with  physical  and  directional  referents.  This  is 
particularly  true  if  robot  recognition  of  gestures  is  dependent  on  a  camera-based 
system.  In  a  somewhat  different  approach,  (Taylor  et  al.  2014)  used  smartphone 
technology  attached  to  a  user’s  wrist  to  capture  both  speech  and  gestural 
movements,  which  was  sent  to  a  remote  laptop  for  processing  and  command  of  a 
robotic  mule.  It  is  clear  that  many  approaches  are  being  explored  from  different 
perspectives  to  more  fully  achieve  effective  and  naturalistic  integration  of  speech 
and  gesture. 

4.2.2  Integration  with  Handheld  Devices 

The  smartphone  is  a  core  element  of  the  Army  concept  of  operations  for  the  ground 
Soldier  (Barker  2013),  providing  the  opportunity  to  use  numerous  apps  to  support 
mission  tasks.  While  the  concept  for  Soldier  use  is  a  popular  one,  with  a  dedicated 
program  of  effort,  many  challenges  remain  to  be  addressed  to  achieve  capabilities 
typical  of  civilian  use  (Erwin  2011)  due  to  limitations  associated  with  secure  use. 
Smartphone  applications  have  incorporated  3-D  audio  and  tactile  feedback  for 
waypoint  navigation  for  US  Air  Force  Battlefield  Ainnen  (Calvo  et  al.  2013)  and 
for  robot  control,  incorporating  features  such  as  vision-based  recognition  and 
speech  control  (Checka  2011).  The  smartphone  platfonn  can  also  be  used  to  support 
gestural  commands.  Two  types  of  handheld  devices  are  prevalent  with  regard  to 
gesture  control.  One  would  be  the  use  of  the  device  to  assist  in  gesture  recognition, 
by  being  the  object  that  is  recognized.  The  Kinect  gaming  device  is  such  an 
example.  The  user  holds  the  device,  which  is  tracked  by  a  camera  system.  This 
approach  was  shown  to  be  easy  to  learn  for  simple  navigation  commands  to  a  robot, 
so  that  the  device  is  used  like  a  virtual  leash  (Olufs  and  Vincze  2009).  The  XWand, 
developed  by  Microsoft  Research,  is  a  wand-like  device  that  enables  the  user  to 
point  at  an  object  and  control  it  using  gestures  and/or  voice.  For  example,  lights 
and  music  can  be  turned  on  and  off  by  pointing  to  the  device  (light  switch,  music 
player)  and  saying  “lights  on”  or  “volume  up”  (Wilson  and  Shafer  2003).  The 
Nintendo  Wii  remote  controller  has  also  been  used  as  a  gestural  device  to  send  7 
different  communications,  representative  of  Anny  hand  and  ann  signals  to  a  tactile 
belt  (Varcholik  and  Merlo  2008). 

Smartphone  visual  displays  can  also  be  used  to  augment  speech  or  gesture.  A  map- 
based  display  addresses  the  challenge  of  directing  a  robot  to  a  particular  location  or 
object.  The  visual  display  may  be  used  with  touch  gestures,  or  integrated  with 
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speech-based  specifications.  A  grid-based  map  display  allows  the  operator  to  refer 
to  a  grid-based  location  when  directing  the  robot.  Alternatively,  an  object-based 
approach  is  based  on  the  labeling  of  referent  objects,  which  can  be  referred  to  from 
different  vantage  points  over  time  (Walter  et  al.  2010;  Pettitt  et  al.  2014). 
Gopalakrishnan  et  al.  (2005)  demonstrated  gains  achieved  through  integrating 
camera-based  gestures  with  visual  displays,  laser-based  localization,  and  speech 
recognition  capabilities. 

Pointing  devices  have  also  been  used  to  enhance  the  precision  of  the  gesture  to 
point  to  areas  or  objects.  Kemp  and  his  associates  (Kemp  et  al.  2008)  utilized  an 
off-the-shelf  green  laser  pointer,  integrated  with  an  omnidirectional  catadioptric 
system  (i.e.,  an  optic  combining  reflection  and  refraction,  such  as  lenses  and  curved 
mirrors)  with  a  narrowband  green  filter.  The  user  points  at  the  object  of  interest, 
which  is  located  by  the  robot.  They  reported  99.4%  accuracy  with  regard  to  the 
robot  looking  at  the  correct  object  and  estimating  its  3-D  location,  and  in  90%  of 
the  trials  the  robot  successfully  moved  to  the  object  and  picked  it  up.  Objects  were 
within  3  m.  Patel  and  Abowd  (2003)  developed  a  2-way  laser-assisted  capability 
on  a  cell  phone,  which  can  select  and  communicate  with  photosensitive  tags  placed 
in  the  environment.  This  general  approach  is  promising  in  that  it  avoids  the 
complexity  with  regard  to  intelligent  understanding  of  spatial  relationships  and 
recognition  of  objects  and/or  pointing  gestures. 

One  recommendation  with  regard  to  handheld  objects  in  general  is  to  make  the 
device  easy  to  find  if  misplaced.  In  particular,  this  can  be  very  important  for 
Soldiers  in  military  operations,  given  combat  missions  that  are  often  executed  at 
night,  by  users  under  high  stress.  A  simple  GPS  chip  within  the  device  can  indicate 
its  position  on  a  map.  In  addition,  the  device  might  emit  audio  cues  when  requested. 
In  the  same  way,  the  robot  itself  may  be  beyond  line  of  sight  and  in  need  of  retrieval. 
In  that  case,  a  tactile  belt  display  can  respond  to  GPS  signals  and  guide  the  wearer 
to  the  robot  location,  while  leaving  the  hands  and  eyes  free,  to  attend  to  weapons 
and  surroundings  (Pomranky-Hartnett  et  al.  2015). 

Some  characteristics  that  constrain  the  use  of  handheld  devices  and  should  be 
considered  prior  to  development  or  selection  for  use  are  as  follows: 

•  Operational  range.  The  handheld  devices  must  be  used  within  system  range. 
At  this  time,  technology  improvements  to  the  electronic  transmission 
capabilities  (for  sender  and  receiver)  will  be  needed  for  longer  distances. 

•  Latency.  Latency  of  signals  may  be  a  problem  when  operational  tempo  is 
fast  and  movements  quick. 
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•  Multiple  handhelds.  There  may  be  networking  limitations  on  the  integration 
of  multiple  handheld  controllers. 

•  Ability  to  be  hands  free.  By  their  very  nature,  handheld  devices  are  not 
hands  free.  Even  if  stored  in  a  pocket,  a  hand  must  be  used  to  hold  the  device 
for  gesture  control. 

5.  Conclusions 


It  is  clear  that  much  progress  has  occurred  with  regard  to  development  of  gestural 
command  systems,  and  that  progress  is  ongoing.  In  addition,  there  is  great  variety 
of  technology  approaches.  In  this  report,  we  describe  many  of  these  approaches, 
and  offer  an  organizing  framework  that  can  allow  the  developer  to  more  closely 
consider  situational  context  and  task  demands  to  better  identify  the  technology  most 
suited  to  the  purpose  at  hand. 

It  is  also  clear  that  integration  with  other  modalities  (e.g.,  visual  map  displays, 
speech  control,  and  bidirectional  communication)  will  offer  a  wider  range  of 
applications  and  greater  effectiveness  with  regard  to  speed,  accuracy,  and  ease  of 
use.  While  camera-based  systems  are  currently  limited,  research  is  ongoing  to 
enable  the  camera  system  to  more  closely  approximate  the  capability  to  not  only 
“see”,  but  to  understand  and  interpret,  not  only  gestures  but  situational  context.  At 
this  time,  however,  both  camera-based  and  instrumented  systems  have  limitations 
that  must  be  considered  when  choosing  the  most  appropriate  system  for  the  task 
and  situational  demands  at  hand.  For  military  use,  generalizable  recommendations 
would  include  weight,  bulk,  maintainability,  power  consumption,  and  ease  of  use. 
In  addition,  the  use  of  wearable  networked  systems  will  always  present  security 
issues  (Hudgens  2013).  Proof-of-concept  technology  must  address  these  issues 
before  transition  to  combat  situations. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


2-D 

2-dimensional 

3-D 

3 -dimensional 

ANN 

artificial  neural  network 

ASL 

American  Sign  Language 

DARPA 

Defense  Advanced  Research  Projects  Agency 

DOF 

degrees  of  freedom 

DTW 

dynamic  time  warping 

EMG 

electromyographic 

FOV 

field  of  view 

FSM 

finite  state  machine 

GNG 

Growing  Neural  Gas 

GPS 

global  positioning  system 

HMM 

hidden  Markov  modeling. 

IMU 

inertial  measurement  unit 

MOPP 

Mission-oriented  Protective  Posture 

NN 

neural  network 

PDA 

personal  digital  assistant 

RBF 

radial  basis  function 

UP 

tactics,  techniques,  and  procedures 

UAV 

unmanned  aerial  vehicle 

WOZ 

Wizard  of  Oz 
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